Nonparametric  Bayesian  Context  Learning  for 
Buried  Threat  Detection 


by 

Christopher  Ralph  Ratto 

Department  of  Electrical  and  Computer  Engineering 
Duke  University 

Date:  _ 

Approved: 

Leslie  M.  Collins,  Supervisor 

Loren  W.  Nolte 

Jeffrey  L.  Krolik 

Qing  H.  Liu 

David  L.  Banks 

Dissertation  submitted  in  partial  fulfillment  of  the  requirements  for  the  degree  of 
Doctor  of  Philosophy  in  the  Department  of  Electrical  and  Computer  Engineering 
in  the  Graduate  School  of  Duke  University 
2012 


Report  Documentation  Page 

Form  Approved 

OMB  No.  0704-0188 

Public  reporting  burden  for  the  collection  of  information  is  estimated  to  average  1  hour  per  response,  including  the  time  for  reviewing  instructions,  searching  existing  data  sources,  gathering  and 
maintaining  the  data  needed,  and  completing  and  reviewing  the  collection  of  information.  Send  comments  regarding  this  burden  estimate  or  any  other  aspect  of  this  collection  of  information, 
including  suggestions  for  reducing  this  burden,  to  Washington  Headquarters  Services,  Directorate  for  Information  Operations  and  Reports,  1215  Jefferson  Davis  Highway,  Suite  1204,  Arlington 

VA  22202-4302.  Respondents  should  be  aware  that  notwithstanding  any  other  provision  of  law,  no  person  shall  be  subject  to  a  penalty  for  failing  to  comply  with  a  collection  of  information  if  it 
does  not  display  a  currently  valid  OMB  control  number. 

1.  REPORT  DATE 

2Qi2  2 • REPORT  TYPE 

3.  DATES  COVERED 

00-00-2012  to  00-00-2012 

4.  TITLE  AND  SUBTITLE 

Nonparametric  Bayesian  Context  Learning  for  Buried  Threat  Detection 

5a.  CONTRACT  NUMBER 

5b.  GRANT  NUMBER 

5c.  PROGRAM  ELEMENT  NUMBER 

6.  AUTHOR(S) 

5d.  PROJECT  NUMBER 

5e.  TASK  NUMBER 

5f.  WORK  UNIT  NUMBER 

7.  PERFORMING  ORGANIZATION  NAME(S)  AND  ADDRESS(ES) 

Duke  University, Department  of  Electrical  and  Computer 

Engineering, Durham, NC, 27708 

8.  PERFORMING  ORGANIZATION 

REPORT  NUMBER 

9.  SPONSORING/MONITORING  AGENCY  NAME(S)  AND  ADDRESS(ES) 

10.  SPONSOR/MONITOR'S  ACRONYM(S) 

11.  SPONSOR/MONITOR'S  REPORT 
NUMBER(S) 

12.  DISTRIBUTION/AVAILABILITY  STATEMENT 

Approved  for  public  release;  distribution  unlimited 

13.  SUPPLEMENTARY  NOTES 


14.  ABSTRACT 

This  dissertation  addresses  the  problem  of  detecting  buried  explosive  threats  (i.e.  landmines  and 
improvised  explosive  devices)  with  ground-penetrating  radar  (GPR)  and  hyperspectral  imaging  (HSI) 
across  widely-varying  environmental  conditions.  Automated  detection  of  buried  objects  with  GPR  and  HSI 
is  particularly  di  cult  due  to  the  sensitivity  of  sensor  phenomenology  to  variations  in  local  environmental 
conditions.  Past  approaches  have  attempted  to  mitigate  the  e  ects  of  ambient  factors  by  designing  statistical 
detection  and  classi  cation  algorithms  to  be  invariant  to  such  conditions.  These  methods  have  generally 
taken  the  approach  of  extracting  features  that  exploit  the  physics  of  a  particular  sensor  to  provide  a 
low-dimensional  representation  of  the  raw  data  for  characterizing  targets  from  non-targets.  A  statistical 
classi  cation  rule  is  then  usually  applied  to  the  features.  However,  it  may  be  di  cult  for  feature  extraction 
techniques  to  adapt  to  the  highly  nonlinear  e  ects  of  near-surface  environmental  conditions  on  sensor 
phenomenology,  as  well  as  to  retrain  the  classi  er  for  use  under  new  conditions.  Furthermore,  the  search 
for  an  invariant  set  of  features  ignores  that  possibility  that  one  approach  may  yield  best  performance 
under  one  set  of  terrain  conditions  (e.g.,  dry"),  and  another  might  be  better  for  another  set  of  conditions 
(e.g.,  wet").  An  alternative  approach  to  improving  detection  performance  is  to  consider  ex-  ploiting 
di  erences  in  sensor  behavior  across  environments  rather  than  mitigating  them,  and  treat  changes  in  the 
background  data  as  a  possible  source  of  supplemenivtal  information  for  the  task  of  classifying  targets  and 
non-targets.  This  approach  is  referred  to  as  context-dependent  learning.  Although  past  researchers  have 
proposed  context-based  approaches  to  detection  and  decision  fusion,  the  de  nition  of  context  used  in  this 
work  di  ers  from  those  used  in  the  past.  In  this  work,  context  is  motivated  by  the  physical  state  of  the  world 
from  which  an  observation  is  made,  and  not  from  properties  of  the  observation  itself.  The  proposed 
context-dependent  learning  technique  therefore  utilized  additional  features  that  characterize  soil 
properties  from  the  sensor  background,  and  a  variety  of  nonparametric  models  were  proposed  for 
clustering  these  features  into  individual  contexts.  The  number  of  contexts  was  assumed  to  be  unknown  a 
priori  and  was  learned  via  Bayesian  inference  using  Dirichlet  process  priors.  The  learned  contextual 
information  was  then  exploited  by  an  ensemble  on  classi  ers  trained  for  classifying  targets  in  each  of  the 
learned  contexts.  For  GPR  applications  the  classi  ers  were  trained  for  performing  algorithm  fusion.  For 
HSI  applications  the  classi  ers  were  trained  for  performing  band  selection.  The  detection 


15.  SUBJECT  TERMS 

16.  SECURITY  CLASSIFICATION  OF: 

17.  LIMITATION  OF 
ABSTRACT 

Same  as 
Report  (SAR) 

18.  NUMBER 
OF  PAGES 

331 

19a.  NAME  OF 
RESPONSIBLE  PERSON 

a.  REPORT 

unclassified 

b.  ABSTRACT 

unclassified 

c.  THIS  PAGE 

unclassified 

Standard  Form  298  (Rev.  8-98) 

Prescribed  by  ANSI  Std  Z39-18 


Abstract 


Nonparametric  Bayesian  Context  Learning  for  Buried  Threat 

Detection 

by 

Christopher  Ralph  Ratto 

Department  of  Electrical  and  Computer  Engineering 
Duke  University 

Date:  _ 

Approved: 


Leslie  M.  Collins,  Supervisor 


Loren  W.  Nolte 


Jeffrey  L.  Krolik 


Qing  H.  Liu 


David  L.  Banks 

An  abstract  of  a  dissertation  submitted  in  partial  fulfillment  of  the  requirements  for 
the  degree  of  Doctor  of  Philosophy  in  the  Department  of  Electrical  and  Computer 

Engineering 

in  the  Graduate  School  of  Duke  University 
2012 


Copyright  ©  2012  by  Christopher  Ralph  Ratto 
All  rights  reserved 


Abstract 


This  dissertation  addresses  the  problem  of  detecting  buried  explosive  threats  (i.e., 
landmines  and  improvised  explosive  devices)  with  ground-penetrating  radar  (GPR) 
and  hyperspectral  imaging  (HSI)  across  widely-varying  environmental  conditions. 
Automated  detection  of  buried  objects  with  GPR  and  HSI  is  particularly  difficult 
due  to  the  sensitivity  of  sensor  phenomenology  to  variations  in  local  environmental 
conditions.  Past  approaches  have  attempted  to  mitigate  the  effects  of  ambient  fac¬ 
tors  by  designing  statistical  detection  and  classification  algorithms  to  be  invariant 
to  such  conditions.  These  methods  have  generally  taken  the  approach  of  extracting 
features  that  exploit  the  physics  of  a  particular  sensor  to  provide  a  low-dimensional 
representation  of  the  raw  data  for  characterizing  targets  from  non-targets.  A  sta¬ 
tistical  classification  rule  is  then  usually  applied  to  the  features.  However,  it  may 
be  difficult  for  feature  extraction  techniques  to  adapt  to  the  highly  nonlinear  effects 
of  near-surface  environmental  conditions  on  sensor  phenomenology,  as  well  as  to  re¬ 
train  the  classifier  for  use  under  new  conditions.  Furthermore,  the  search  for  an 
invariant  set  of  features  ignores  that  possibility  that  one  approach  may  yield  best 
performance  under  one  set  of  terrain  conditions  (e.g.,  “dry”),  and  another  might  be 
better  for  another  set  of  conditions  (e.g.,  “wet”). 

An  alternative  approach  to  improving  detection  performance  is  to  consider  ex¬ 
ploiting  differences  in  sensor  behavior  across  environments  rather  than  mitigating 
them,  and  treat  changes  in  the  background  data  as  a  possible  source  of  supplemen- 


tal  information  for  the  task  of  classifying  targets  and  non-targets.  This  approach  is 
referred  to  as  context-dependent  learning. 

Although  past  researchers  have  proposed  context-based  approaches  to  detection 
and  decision  fusion,  the  definition  of  context  used  in  this  work  differs  from  those 
used  in  the  past.  In  this  work,  context  is  motivated  by  the  physical  state  of  the 
world  from  which  an  observation  is  made,  and  not  from  properties  of  the  observa¬ 
tion  itself.  The  proposed  context-dependent  learning  technique  therefore  utilized 
additional  features  that  characterize  soil  properties  from  the  sensor  background,  and 
a  variety  of  nonparametric  models  were  proposed  for  clustering  these  features  into 
individual  contexts.  The  number  of  contexts  was  assumed  to  be  unknown  a  priori , 
and  was  learned  via  Bayesian  inference  using  Dirichlet  process  priors. 

The  learned  contextual  information  was  then  exploited  by  an  ensemble  on  clas¬ 
sifiers  trained  for  classifying  targets  in  each  of  the  learned  contexts.  For  GPR  ap¬ 
plications,  the  classifiers  were  trained  for  performing  algorithm  fusion.  For  HSI  ap¬ 
plications,  the  classifiers  were  trained  for  performing  ba7id  selection.  The  detection 
performance  of  all  proposed  methods  were  evaluated  on  data  from  U.S.  government 
test  sites.  Performance  was  compared  to  several  algorithms  from  the  recent  literature, 
several  which  have  been  deployed  in  fielded  systems.  Experimental  results  illustrate 
the  potential  for  context-dependent  learning  to  improve  detection  performance  of 
GPR  and  HSI  across  varying  environments. 
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Introduction 


1.1  Landmine  and  IED  Detection 

Detection  and  remediation  of  buried  explosives  is  a  serious  problem  faced  by  military 
and  civilian  personnel  around  the  world.  Historically,  this  threat  has  taken  the  form 
of  anti-tank  (AT)  and  anti-personnel  (AP)  landmines,  which  are  typically  emplaced 
en  masse  over  a  wide  area  as  a  strategic  barrier  to  prevent  enemy  advances.  The 
use  of  landmines  in  armed  conflict  often  results  in  a  severe  humanitarian  problem 
once  fighting  has  ended,  as  the  majority  of  casualties  of  landmine  detonations  in 
post-conflict  regions  tend  to  be  civilians.  According  to  the  International  Campaign 
to  Ban  Landmines,  civilians  made  up  approximately  70%  of  the  3,531  worldwide 
casualties  due  to  landmines  and  unexploded  ordnance  in  2009,  and  children  made 
up  almost  a  third  of  all  casualties  for  whom  the  age  was  known  [1], 

Over  the  past  decade,  a  new  threat  has  emerged  with  the  proliferation  of  im¬ 
provised  explosive  devices  (IEDs),  which  the  United  States  Department  of  Defense 
reports  as  the  leading  cause  of  casualties  to  American  soldiers  in  Iraq  and  Afghanistan 
[2].  Unlike  landmines,  IEDs  by  definition  are  not  systematically  manufactured  and 
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vary  widely  in  the  explosive  compounds,  containers,  and  detonation  mechanisms  used 
in  their  construction.  Often,  the  main  charge  of  an  1ED  is  composed  of  a  fertilizer 
such  as  ammonium  nitrate  and  a  solid  fuel  such  as  aluminum  or  sugar,  and  containers 
tend  to  be  common  items  such  as  plastic  jugs,  buckets,  or  metal  cooking  pots  [3].  A 
recent  study  has  found  that  although  the  total  number  of  worldwide  casualties  from 
victim-activated  explosives  (including  landmines,  lEDs,  and  unexploded  ordnance) 
has  decreased  from  5,426  in  2007  to  3,956  in  2009,  victim-activated  IED  casualties 
have  increased  in  absolute  terms  (80  in  2008  to  549  in  2009)  and  percentage  of  all 
attacks  (3%  in  2008  to  18%  in  2009)  [4],  Over  half  of  these  casualties  have  occurred 
in  Afghanistan  (accounting  for  20%  of  total  casualties  in  that  country),  with  other 
countries  reporting  anti-personnel  IED  casualties  including  Cambodia,  the  Demo¬ 
cratic  Republic  of  the  Congo,  India,  Iraq,  Nepal,  Pakistan,  Peru,  Colombia,  Burma, 
and  Turkey. 

In  landmine  and  IED  detection,  as  in  many  other  detection  problems,  the  ultimate 
goal  is  to  robustly  and  accurately  identify  objects  of  interest  with  as  few  false  alarms 
as  possible.  This  trade-off  can  be  expressed  in  terms  of  probability  of  detection 
(PD)  and  either  probability  of  false  alarm  (PF)  or  false  alarm  rate  (FAR),  with 
the  later  usually  measured  in  units  of  false  alarms  per  square  meter  (FA/m2).  The 
obvious  risks  faced  by  humanitarian  deminers  or  military  route  clearance  patrols 
make  landmine  and  IED  remediation  very  costly  and  time-consuming.  It  has  been 
estimated  that  while  it  may  only  cost  a  few  dollars  to  manufacture  and  emplace  a 
single  landmine,  the  cost  of  safely  removing  and  neutralizing  it  can  run  from  several 
hundred  to  one  thousand  dollars  [5].  Therefore,  the  trade-off  between  detection  and 
false  alarm  rate  can  also  be  seen  as  a  trade-off  between  safety  and  cost.  Humanitarian 
deminers  may  require  a  PD  of  1  at  the  lowest  FAR  possible  [6],  while  military  route 
clearance  patrols  may  stress  the  importance  of  maintaining  a  constant  rate  of  advance 
through  a  potentially-threatening  area  and  may  be  content  with  a  PD  as  low  as 
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0.90  [7], 

Currently,  a  major  focus  of  military  research  is  improving  the  detection  robustness 
of  counter- mine/IED  platforms  used  in  Afghanistan.  Afghanistan  is  notorious  for  its 
difficult  terrain,  with  the  South  and  West  characterized  by  the  Registan  Desert  and 
Sistan  Basin  (one  of  the  driest  places  on  Earth),  and  the  North  and  East  include 
the  Hindu  Kush  and  Pamir  mountains  [9].  Within  the  desert  and  mountainous 
regions,  the  geology  is  highly  variable,  even  within  single  provinces  [8].  The  climate 
of  Afghanistan  varies  regionally,  with  the  Southwest  portion  of  the  country  being 
considerably  drier  than  the  Northeast,  where  mountain  snowfall  contributes  to  wetter 
conditions  at  lower  elevations. 

The  impact  of  varying  terrain  and  weather  conditions  on  the  performance  of 
counter-mine/IED  sensors  is  tremendous.  This  dissertation  primarily  focuses  on 
algorithms  for  detecting  buried  threats  with  ground-penetrating  radar  (GPR).  Al¬ 
though  GPR  has  long  been  used  in  a  variety  of  applications,  its  effectiveness  in 
landmine  detection  has  been  highlighted  in  much  of  the  research  literature  over  the 
past  decade.  However,  the  unique  signal  processing  challenges  presented  by  vary¬ 
ing  environmental  factors  must  be  considered.  The  following  section  introduces  the 
phenomenology  of  GPR,  its  sensitivity  to  various  environmental  factors,  and  past 
approaches  to  improve  detection  performance. 

1.2  Ground- Penetrating  Radar 

1.2.1  Background 

GPR  operates  by  transmitting  an  electromagnetic  signal  (e.g.,  a  differentiated  Gaus¬ 
sian  pulse)  into  the  ground  and  measuring  the  reflections  of  the  signal  at  subsurface 
dielectric  interfaces  in  either  the  temporal  or  frequency  domain.  The  versatility 
of  GPR  is  best  illustrated  by  its  wide  range  of  applications,  which  include  geo¬ 
physics,  forensics,  utilities,  and  archeology  [10].  Over  the  past  two  decades,  GPR 
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has  emerged  as  a  complementary  alternative  to  electromagnetic  induction  (EMI) 
sensors  (i.e.,  “metal  detectors”)  as  the  next  generation  of  landmine  detection  sys¬ 
tems  [11-13].  Metal  detectors  have  historically  performed  very  poorly  in  detecting 
nonmetal  targets  because  they  rely  on  inducing  currents  in  buried  conductors  and 
sensing  the  resulting  magnetic  field.  Therefore,  it  may  be  difficult  to  detect  targets 
such  as  plastic,  ceramic,  or  wood  landmines  or  IEDs  using  an  EMI  sensor.  GPR  can 
potentially  be  used  to  detect  any  type  of  buried  object,  as  long  as  the  its  dielectric 
properties  contrast  with  the  surrounding  soil  to  reflect  the  transmitted  signal. 

GPRs  used  in  buried  threat  detection  tend  to  be  wide-band  systems  with  a  fre¬ 
quency  range  and  spatio-temporal  sampling  rates  much  higher  than  those  used  in 
most  geophysical  applications.  For  example,  the  GPR  used  in  the  Husky  Mounted 
Detection  System  (HMDS)  manufactured  by  NIITEK,  Inc.  (shown  in  Figure  1.1) 
transmits  a  differentiated  Gaussian  GPR  signal  with  a  bandwidth  of  200  MHz  -  7 
GHz,  and  time-gates  the  received  reflections  at  6.6  ns  (which  corresponds  to  1  m 
ranging  in  air)  [14].  The  received  time-domain  signal  is  referred  to  as  an  A-scan. 
An  example  of  a  GPR  A-scan  collected  over  an  anti-tank  landmine  is  shown  in  Fig¬ 
ure  1.2.  The  first  received  pulse  is  the  reflection  from  the  ground  surface,  referred  to 
as  ground-bounce ,  and  is  typically  of  high  magnitude.  After  the  ground  bounce,  the 
reflection  from  the  target  is  received  and  generally  is  of  lesser  magnitude  and  may 
be  embedded  in  clutter  corresponding  to  reflections  between  subsurface  layers. 

In  vehicular  GPR  systems  such  as  the  HMDS,  A-scans  may  be  collected  at  mul¬ 
tiple  spatial  locations  to  form  a  two-dimensional  “image”  of  the  subsurface  that  is 
referred  to  as  a  B-scan.  A  B-scan  may  illustrate  the  signals  received  from  each  chan¬ 
nel  across  the  array  (the  crosstrack  direction),  or  at  locations  corresponding  to  the 
direction  of  vehicle  motion  (the  downtrack  direction).  An  example  of  a  GPR  B-scan 
collected  over  the  same  anti-tank  landmine  is  shown  in  Figure  1.3.  The  B-scan  allows 
for  visual  interpretation  of  the  relative  locations  of  the  ground,  subsurface  layer,  and 
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Figure  1.1:  The  NIITEK  Husky  Mounted  Detection  System  (HMDS),  which  con¬ 
sists  of  4  GPR  antenna  array  panels  (each  with  12  channels)  mounted  in  front  of  a 
Husky  route  clearance  vehicle  [14]. 


Example  GPR  A-Scan  Collected  Over  Anti-Tank  Mine 


Time  (Samples) 


Figure  1.2:  An  example  of  a  GPR  A-scan  collected  over  an  anti-tank  landmine 
buried  under  a  paved  road.  The  horizontal  axis  represents  time  (in  samples)  and 
the  vertical  axis  represents  the  amplitude  of  the  received  signal.  Received  pulses 
corresponding  to  the  ground-bounce,  subsurface  layering,  and  the  target  itself  are 
marked. 


5 


Example  GPR  B-Scan  Collected  Over  Anti-Tank  Mine 


,0.02 


400 


Landmine  signature 


-0.01 

0.015 


500 


450 


'-0.02 


5 


10  15 

Downtrack  (Samples) 


20 


25 


Figure  1.3:  An  example  of  a  GPR  B-scan  collected  over  an  anti-tank  landmine 
buried  under  a  paved  road.  The  horizontal  axis  represents  downtrack  position  (in 
samples),  the  vertical  axis  represents  time  (in  samples),  and  the  amplitude  of  the 
received  signal  corresponds  to  pixel  color.  Received  pulses  corresponding  to  the 
ground-bounce,  subsurface  layering,  and  the  target  itself  are  marked. 

target  over  a  given  area.  Note  that  the  landmine  signature  has  a  distinctive  hy¬ 
perbolic  shape  as  the  sensor  approaches  and  passes  over  the  target.  This  distinctive 
property  of  GPR  phenomenology  is  exploited  by  many  statistical  pattern  recognition 
algorithms  which  will  be  discussed  later. 

The  frequency  range,  lack  of  significant  self-signature  artifacts,  and  high  spatial 
and  temporal  sampling  rates  of  the  NI1TEK  GPR  has  made  it  an  attractive  choice 
for  high- resolution  subsurface  imaging.  The  great  amount  of  detail  in  a  target’s 
GPR  signature  can  potentially  allow  for  inference  of  its  geometry,  composition,  and 
inner  structure  [15].  Figure  1.4  illustrates  the  GPR  signatures  of  four  different  anti¬ 
tank  landmines,  two  high-metal  and  two  low-metal  types,  buried  at  the  same  depth 
in  a  dirt  road.  The  signatures  of  the  metallic  targets  are  higher  in  energy,  since 
the  metal  casings  reflect  the  incident  GPR  pulse  almost  perfectly.  When  several  A- 
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FIGURE  1.4:  Example  GPR  B-scans  illustrating  the  signatures  of  different  anti-tank 
landmine  types.  The  top  two  B-scans  illustrate  signatures  of  landmines  with  high 
metal  content,  and  the  bottom  two  B-scans  illustrate  signatures  of  landmines  with 
low  metal  content. 


scans  are  collected  over  the  target,  the  resulting  B-scan  illustrates  a  single  hyperbolic 
target  signature.  While  the  plastic  targets’  signatures  are  lower  in  energy,  they  are 
characterized  by  multiple  reflections  that  occur  within  the  landmine  itself.  Therefore, 
the  signatures  of  plastic  targets  are  made  up  of  multiple  hyperbolas  decreasing  in 
energy  with  time. 

GPR  signatures  are  rich  in  information  about  shape,  size,  and  composition  of 
a  buried  target.  Therefore,  GPR  data  has  shown  to  be  applicable  for  statistical 
pattern  recognition  algorithms  to  differentiate  between  responses  from  targets  and 
non-threatening  clutter,  including  natural  and  artificial  debris,  rocks,  roots,  and 
empty  holes.  However,  a  significant  challenge  is  encountered  when  classifying  GPR 
signatures  collected  across  widely-varying  environmental  conditions,  such  as  different 
soil  types  or  moisture  levels.  The  effects  of  these  environmental  factors  on  GPR  have 
been  studied  extensively,  and  the  body  of  research  in  this  area  is  summarized  in  the 
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following  subsection. 


1.2.2  Environmental  Effects  on  GPR  Sensing 

The  signals  generated  and  sensed  by  GPR  are  very  sensitive  to  fluctuations  in  en¬ 
vironmental  conditions  because  unlike  metal  detectors,  GPR  signals  interact  with 
virtually  everything  present  in  the  local  environment.  A  large  body  of  research  has 
investigated  the  effects  of  various  environmental  factors  on  the  performance  of  GPR 
in  landmine  detection  applications.  In  particular,  researchers  have  focused  on  the 
effects  of  soil  dielectric  properties  (i.e.  electrical  permittivity  and  conductivity),  het¬ 
erogeneity,  and  surface  texture. 

Permittivity  is  an  property  of  soil  that  partially  governs  the  speed  at  which  elec¬ 
tromagnetic  waves  propagate  through  it.  It  is  a  factor  of  various  physical  properties 
of  the  soil,  including  grain  size  and  composition  as  well  as  moisture  content  [10].  Of¬ 
ten,  a  material’s  permittivity  is  expressed  in  terms  of  its  value  relative  to  that  of  free 
space  (e0  =  8.85  x  1CT12  F/m)  through  its  relative  permittivity  or  dielectric  constant, 
er.  A  seminal  paper  by  Topp  et  al.  focused  on  the  effect  of  increased  moisture  on 
the  dielectric  constant  of  soils,  and  illustrated  that  a  polynomial  relationship  exists 
between  dielectric  constant  and  volumetric  soil  water  content  [16].  Later  investiga¬ 
tions  by  Miller  et  al.  also  illustrated  that  the  effect  of  soil  moisture  on  conductivity 
is  also  nonlinear,  exhibiting  a  logarithmic  relationship  in  which  increasing  moisture 
generally  increases  conductivity  to  a  saturation  level  [17,18]. 

Permittivity  and  conductivity  affect  GPR  signals  in  many  ways.  The  greatest 
effect  is  due  to  dielectric  contrast  between  the  target  and  surrounding  soil.  If  the 
contrast  between  the  two  materials’  dielectric  properties  is  large,  waves  will  reflect  off 
of  the  target  with  greater  magnitude  than  if  their  dielectric  properties  were  similar. 
Borchers  et  al.  illustrated  that  in  many  cases,  increasing  soil  moisture  also  increases 
this  dielectric  contrast  to  yield  target  signatures  with  higher  magnitude  [19].  Fur- 


thermore,  soils  with  higher  dielectric  constants  will  force  GPR  pulses  to  propagate 
more  slowly  through  them.  Miller  et  al.  demonstrated  this  effect,  in  which  the  GPR 
response  of  a  target  appeared  later  in  time  in  soils  with  high  dielectric  constant,  and 
can  easily  be  confused  with  the  response  of  a  deeper  target  buried  in  a  soil  with  low 
dielectric  constant  [17,18].  Electrical  conductivity  governs  the  rate  at  which  prop¬ 
agating  electromagnetic  waves  are  attenuated  due  to  heat  dissipation.  Increased 
conductivity  will  dissipate  propagating  waves  faster  than  soils  with  low  conductivity, 
and  will  greatly  diminish  the  amplitudes  of  GPR  responses.  Takahashi  et  al.  sug¬ 
gested  that  the  effects  of  increased  conductivity  on  the  fidelity  of  target  signatures 
are  only  noticeable  for  high  values,  measured  on  the  order  of  0.1  S/m  [20]. 

Figure  1.5  illustrates  the  effect  of  increasing  soil  moisture  on  the  GPR  signature 
of  another  low-metal,  anti-tank  landmine.  Each  of  the  three  B-scans  corresponds  to 
a  different  moisture  scenario;  the  left  plot  corresponds  to  dry  conditions  (more  than  5 
days  since  the  last  rainfall),  the  center  plot  corresponds  to  moderate  conditions  (3-5 
days  since  the  last  rainfall),  and  the  right  plot  corresponds  to  wet  conditions  (less 
than  3  days  since  the  last  rainfall).  Note  how  the  target’s  hyperbolic  signature  both 
decreases  in  energy  and  appears  later  in  time  as  moisture  increases.  This  is  due  to 
combined  effects  of  moisture  on  soil  permittivity  and  conductivity.  Increased  mois¬ 
ture  decreases  dielectric  contrast  between  the  target  and  surrounding  soil,  while  also 
increasing  attenuation.  As  a  result,  and  forces  the  target’s  GPR  response  decreases 
in  magnitude.  Increased  moisture  also  decreases  the  propagation  speed,  causing  the 
response  to  appear  later  in  time. 

Subsurface  heterogeneity  is  another  major  factor  impacting  the  performance  of 
GPR  sensors.  Soils  are  naturally  heterogeneous,  composed  of  a  mixture  of  organic 
and  non-organic  matter,  and  reflections  of  GPR  pulses  from  heterogeneities  can 
yield  significant  amounts  of  clutter  in  GPR  signals.  Types  of  natural  heterogeneity 
include  buried  rocks,  roots,  animal  burrows,  as  well  as  stratifications  in  soil  moisture, 
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Figure  1.5:  GPR  B-scans  of  a  low-metal,  anti-tank  landmine  buried  at  3  inches 
under  different  moisture  conditions.  Left:  dry  conditions,  i.e.  greater  than  5  days 
since  the  last  rainfall;  Center:  moderate  conditions,  i.e.  between  3-5  days  since  the 
last  rainfall;  Right:  wet  conditions,  i.e.  less  than  3  days  since  the  last  rainfall.  [21] 

density,  or  composition.  The  effects  of  heterogeneity  on  the  performance  of  GPR  in 
subsurface  target  detection  have  generally  been  studied  in  experiments  controlled  by 
electromagnetic  simulations.  In  a  study  by  Gfirel  and  Oguz  [22] ,  heterogeneities  were 
approximated  by  random  subsurface  scatterers  and  were  varied  in  quantity,  size,  and 
shape.  These  experiments  demonstrate  that  in  very  heterogeneous  soils,  scattering 
from  the  individual  heterogeneities  can  severely  mitigate  the  GPR  signature  of  the 
primary  target  via  destructive  interference.  In  these  scenarios,  visual  target  detection 
becomes  increasingly  difficult  and  automated  techniques  yield  high  false  alarm  rates. 

In  landmine  detection  applications  which  concern  primary  and  secondary  roads, 
it  is  also  important  to  consider  the  effects  of  road  construction.  The  presence  of 
bumps,  potholes,  or  obstructions  in  a  road  can  cause  the  GPR  array  to  bounce  ver¬ 
tically,  and  depending  on  the  displacement  of  the  antenna  significant  propagation 
losses  can  be  induced  along  with  distortion  of  the  hyperbolic  shape  that  character¬ 
ized  a  target  signature,  as  presented  by  Milner  [23].  Furthermore,  elements  of  the 
road  surface  such  as  gravel,  asphalt,  and  concrete  layers  of  can  also  yield  significant 
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clutter  in  a  manner  similar  to  soil  inhomogeneities.  These  effects  become  even  more 
pronounced  when  the  surfaces  are  rough.  A  variety  of  simulated  GPR  experiments 
have  been  performed  to  determine  the  feasibility  of  subsurface  target  detection  in 
the  presence  of  rough  surface.  [24-27]  and  subsurface  [28]  interfaces.  In  these  stud¬ 
ies,  rough  surfaces  were  generally  simulated  by  a  stochastic  process  with  a  Gaussian 
spectrum,  parameterized  by  its  variance  and  correlation  length.  It  has  generally 
been  found  that  variations  of  both  parameters  impact  the  GPR  responses  of  targets, 
with  variance  dominating  the  overall  effect  on  arrival  time  and  correlation  length 
impacting  distortion  of  the  signature’s  hyperbolic  shape.  Inclusion  of  rough  subsur¬ 
face  layers  (e.g.,  the  asphalt /concrete  or  concrete/soil  interfaces)  in  the  detection 
scenario  further  compounds  these  effects. 

Figure  1.6  illustrates  B-scans  containing  the  GPR  signature  of  the  same  low- 
metal,  anti-tank  landmine  buried  at  the  same  depth  in  four  different  types  of  road 
construction:  dirt,  gravel,  asphalt,  and  concrete.  It  can  be  seen  in  the  dirt  and  gravel 
B-scans  that  the  target’s  signature  is  surrounded  by  responses  from  other  subsurface 
heterogeneities.  These  could  be  rocks  or  local  differences  in  soil  density  or  moisture. 
The  ground  bounce  also  illustrates  the  effects  rough  surface  scattering,  with  several 
“blobs”  of  high-energy  reflections  occurring  immediately  after  the  primary  ground 
reflection.  The  asphalt  lane  exhibits  an  intermediate  layer,  which  is  characterized  by 
a  reflection  at  its  top  and  bottom  interfaces  that  appears  to  be  of  similar  magnitude 
to  the  target  response.  The  asphalt  surface  is  also  smoother  than  the  dirt  and 
gravel,  as  illustrated  by  the  ground  bounce.  Finally,  concrete  appears  to  be  the  most 
homogeneous  type  of  lane.  The  target  signature  stands  out,  and  is  not  surrounded 
by  any  secondary  signatures  from  subsurface  clutter.  The  ground  bounce  is  like  that 
for  the  asphalt  lane,  since  the  surface  is  paved  and  therefore  smoother  than  dirt 
or  gravel.  The  concrete  layer  also  appears  either  to  have  little  dielectric  contrast 
with  the  soil  below  it,  or  has  caused  the  GPR  pulse  to  propagate  so  slowly  that  it 
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Figure  1.6:  GPR  B-scans  of  a  low-metal,  anti-tank  landmine  buried  in  four  different 
types  of  road  construction:  dirt  (left),  gravel  (center-left),  asphalt  (center-right),  and 
concrete  (right). 


did  not  reach  the  soil  layer  in  the  alloted  time,  since  a  distinct  reflection  from  the 
concrete/soil  interface  is  not  visible  at  the  same  scale  as  other  reflections. 

1.2.3  Buried  Threat  Detection  with  GPR  in  Changing  Environmental  Conditions 

Due  to  the  tremendous  impact  that  varying  environmental  conditions  have  on  GPR 
signatures  of  buried  targets,  much  research  has  focused  on  the  task  of  robust  auto¬ 
mated  detection  and  discrimination.  These  approaches  mostly  fall  under  two  general 
categories.  The  first  group  of  techniques  that  will  be  discussed  includes  techniques 
based  upon  electromagnetic  theory,  which  utilize  model  inversion  strategies  to  decou¬ 
ple  the  interactions  of  GPR  signals  with  the  target  from  environmental  artifacts.  In 
contrast,  the  second  category  consists  of  statistical  methods,  which  are  based  upon 
adaptive  signal  processing,  pattern  recognition,  and  machine  learning  theory. 

Inversion  Approaches  to  Target  Detection  with  GPR 

The  first  major  category  of  approaches  to  target  detection  with  GPR  involve  inverse 
solutions  to  Maxwell’s  equations  via  rigorous  scattering  models.  The  aim  of  these 
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approaches  is  to  explicitly  model  the  environment’s  response  to  GPR  and  decouple 
it  from  the  response  of  the  target.  After  recovering  the  basic  GPR  signature  of  the 
target,  visual  confirmation  or  a  simple  detector  can  be  used  to  determine  whether  a 
target  is  present. 

Several  electromagnetic  model  inversion  techniques  have  been  proposed  for  miti¬ 
gating  the  effects  of  antenna  reverberation  [29-31],  rough  surface  scattering  [27],  and 
lossy/moist  soils  [32-34],  Prototype  GPR  signatures,  either  collected  in  a  laboratory 
or  in  a  controlled  field  campaign  are  usually  employed  as  a  target  model.  When 
data  is  collected  in  the  field,  a  deconvolution  technique  is  applied  to  the  GPR  signals 
for  isolating  the  target  signature  from  the  environmental  artifacts.  Inversion  tech¬ 
niques  have  been  shown  to  quantitatively  estimate  various  environmental  parameters 
(e.g.,  soil  permittivity  and  conductivity)  in  addition  to  several  aspects  of  the  target’s 
geometry,  including  its  location  and  burial  depth. 

However,  applying  closed-form  model  inversions  to  GPR  data  pose  several  im¬ 
plementation  difficulties  that  must  be  considered.  The  greatest  shortfall  of  inverse 
modeling  lies  in  the  time  needed  to  compute  these  solutions;  subsurface  threat  de¬ 
tection  is  already  an  arduous  and  time-consuming  task,  and  improvements  in  tech¬ 
nology  should  not  impose  any  additional  time  expense  onto  deminers.  Furthermore, 
vehicular  route  clearance  platforms  are  required  to  operate  at  a  constant  rate  of  ad¬ 
vance,  and  therefore  all  on-board  algorithms  must  operate  in  real-time  [7].  Finally, 
closed-form  models  are  difficult  to  obtain  for  GPR  responses  from  non-canonical  or 
oddly-shaped  targets.  Even  if  numerically-simulated  or  laboratory-measured  proto¬ 
type  signals  can  be  obtained,  it  will  be  difficult  to  keep  up  with  the  threat  of  IEDs 
that  are  constantly  evolving  with  changes  in  countermeasures,  available  material, 
training  of  bomb-makers,  and  the  sophistication  of  production  facilities.  Alterna¬ 
tively,  statistical  techniques  may  be  a  more  robust  approach  to  accounting  for  these 
aspects  of  potential  targets,  as  well  as  the  ever-changing  subsurface  environment,  in 
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subsurface  threat  detection  algorithms. 


Statistical  Approaches  to  Target  Detection  with  GPR 

The  category  of  statistical  techniques  for  target  detection  in  GPR  can  be  further  di¬ 
vided  into  two  sub-categories:  prescreeners  and  classifiers.  Prescreeners  are  compu¬ 
tationally  inexpensive  anomaly  detectors  that  must  detect  a  wide  variety  of  potential 
threats  and  adapt  to  changing  background  statistics.  Although  template  matching 
techniques  based  on  correlation  filters  can  perform  well  in  detecting  specific  target 
types  in  a  static  environment,  as  demonstrated  by  Brunzell  [35],  they  may  fail  when 
faced  with  a  diverse  target  population  and  multiple  environments.  Instead,  adaptive 
filtering  approaches  have  shown  promise  as  prescreeters  that  model  the  GPR  back¬ 
ground  and  detect  anomalies  that  statistically  differ  from  the  background.  Examples 
include  linear  prediction  as  proposed  by  Ho  et  al.  [36]  and  Yoldemir  and  Sezgin  [37], 
least-mean-square  (LMS)  prediction  proposed  by  Torrione  et  al.  [38,39],  and  particle 
filters  proposed  by  Ng  et  al.  [40].  The  goal  of  prescreening  is  to  detect  all  of  the 
anomalies  present  in  the  data,  whether  they  are  associated  with  true  landmine  sig¬ 
natures  or  not.  The  leading  prescreeners  do  succeed  at  this,  but  also  mistake  many 
clutter  anomalies  for  potential  targets.  Therefore,  prescreeners  generally  perform  at 
a  high  PD,  but  at  the  expense  of  a  moderate  FAR. 

A  larger  body  of  research  has  been  focused  on  the  development  of  feature-based 
classifiers  based  on  statistical  pattern  recognition  and  machine  learning  theory.  After 
the  prescreener  finds  locations  in  the  raw  data  where  an  anomaly  is  present  (referred 
to  as  alarms),  features  are  extracted  to  provide  a  low-dimensional  representation 
of  the  GPR  data  collected  at  that  location.  Features  are  generally  physics-based 
and/or  morphological,  and  aim  to  be  invariant  with  respect  to  the  environment.  The 
classifier  then  applies  a  statistical  decision  rule  to  the  feature  space,  and  classifies  the 
anomalies  as  targets  or  clutter.  The  approach  of  prescreening  followed  by  feature- 
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based  classification  has  shown  to  be  effective  in  maintaining  high  PD  while  reducing 
PF/FAR  to  levels  appropriate  for  fielded  systems  [39,41-47]. 

Feature  extraction  approaches  are  generally  motivated  by  the  underlying  phe¬ 
nomenology  of  a  particular  sensor,  so  as  to  exploit  the  physical  characteristics  of 
target  responses.  In  GPR,  feature  extraction  is  used  to  characterize  the  hyperbolic 
shape  and  reverberation  properties  of  target  responses.  A  wide  variety  of  feature 
extraction  approaches  have  been  proposed  in  the  recent  literature,  including  edge- 
based  [42-44,48],  spectral  [49,50],  geometric  [45,46,51],  and  texture  [47]  features.  The 
decision  rules  are  learned  from  the  features  using  statistical  models.  These  include 
hidden  Markov  models  [42,48],  self-organizing  maps  and  fuzzy  fc-nearest  neighbors 
(KNN)  [43,44],  relevance  vector  machines  [47],  and  neural  networks  [45,46].  GPR 
features  have  also  been  combined  with  features  extracted  from  other  sensor  data, 
such  as  EMI  or  seismic  sensors  [44,52-54],  as  a  feature- level  form  of  sensor  fusion. 

Until  recently,  the  performance  of  leading  feature-based  landmine  detection  al¬ 
gorithms  were  not  compared  with  respect  to  environmental  context.  Wilson  et  al. 
made  a  large-scale  comparison  between  four  leading  classification  algorithms  on  a 
large  GPR  data  set  that  was  collected  at  four  environmentally  distinct  test  sites  [41], 
The  following  algorithms  were  compared:  hidden  Markov  model  (HMM)  algorithm 
proposed  by  Gader  et  al.  [42,48],  the  edge  histogram  descriptor  (EHD)  algorithm 
proposed  by  Frigui  et  al.  [44,55],  the  algorithm  based  on  geometric  features  (GEOM) 
proposed  by  Gader  et  al.  [45],  and  the  spectral  correlation  feature  (SCF)  algorithm 
proposed  by  Ho  et  al.  [49]. 

Table  1.1  summarizes  the  results  of  the  experiment,  in  which  the  algorithms  were 
ranked  based  on  benchmark  PDs  and  FARs.  The  table  illustrates  that  although 
EHD  and  HMM  were  the  best-performing  algorithms  on  the  aggregate  of  all  sites, 
certain  algorithms  performed  better  than  others  on  specific  sites  and  for  specific 
performance  metrics.  In  other  words,  the  comparisons  by  Wilson  et  al.  showed 
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Table  1.1:  Performance  of  Landmine  Detection  Algorithms  as  Compared  by  Wilson 
et  al.,  [41]. 


Metric 

PD 

=  .95 

PD 

=0.90 

PD 

=0.85 

FAR 

=0 

FAR 

=0.0007 

FAR 

=0.00007 

Site  A 

EHD 

HMM 

SCF 

GEOM 

SCF 

GEOM 

Site  B 

EHD 

EHD 

EHD 

EHD 

EHD 

EHD 

Site  C 

SCF 

GEOM 

GEOM/SCF 

EHD 

EHD 

GEOM 

Site  D 

EHD 

HMM 

HMM 

EHD 

EHD 

EHD 

All  Sites 

EHD 

HMM 

HMM 

EHD 

EHD 

EHD 

that  there  is  currently  no  “silver  bullet”  classifier  for  GPR-based  landmine  detection 
across  all  environments.  Furthermore,  since  the  four  algorithms  exploit  complemen¬ 
tary  features  of  GPR  signatures,  it  was  suggested  that  algorithm  fusion  may  provide 
additional  performance  benefits.  Experimental  results  illustrated  that  fusing  the 
confidences  of  each  algorithm,  weighted  according  to  their  relative  performance  in 
each  environment,  could  yield  significant  performance  improvements. 

1.3  Context-Dependent  Learning 

The  impact  of  underlying  contextual  factors  on  how  observations  can  be  interpreted 
is  not  unique  to  landmine  signatures  in  GPR  data.  Such  effects,  known  as  context- 
dependency,  have  been  investigated  much  earlier  in  the  field  of  semantic  memory  [56] . 
Words  have  virtually  an  infinite  number  of  properties  (e.g.  “hospital”  is  both  a 
“building”  and  “a  place  where  food  is  served”).  However,  certain  properties  may  be 
emphasized  by  how  the  word  appears  in  certain  semantic  context  (context-dependent 
properties),  while  others  are  always  evident  (context-independent  properties).  Re¬ 
ferring  to  the  hospital  example,  it  is  clearly  evident  that  a  hospital  is  a  building, 
making  it  a  context-independent  property.  However,  the  property  of  hospitals  being 
a  place  of  food  service  may  only  become  evident  in  a  discussion  with  patients  and 
their  dietitians,  therefore  making  it  a  context-dependent  property. 

In  statistical  learning,  the  manifestation  of  context-dependent  properties  may 
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come  in  the  form  of  changes  in  the  distribution  of  a  class  or  variable  of  interest  (i.e., 
the  target  concept )  with  respect  to  underlying  contextual  factors.  This  problem  is 
often  referred  to  as  concept  drift  [57,58].  For  learning  in  the  presence  of  concept 
drift,  it  may  be  beneficial  to  utilize  a  context-dependent  model.  Speech  recognition 
is  a  field  that  embraced  this  notion  early  on,  where  it  was  shown  that  context- 
dependent  phonetic  models  (i.e.,  modeling  phones  as  statistically-dependent  on  the 
phones  immediately  preceding  and  following  it)  yielded  substantial  improvements  in 
the  word  recognition  performance  [59-61]. 

In  remote  sensing  applications,  it  can  be  useful  to  exploit  the  dependency  of  sensor 
phenomenology  on  ambient  environmental  factors.  Although  the  contextual  factors 
being  exploited  are  often  sensor-specific,  the  common  thread  is  that  local  similarities 
in  sensor  data  can  be  exploited  to  improve  overall  robustness  in  detection  perfor¬ 
mance.  For  example,  in  airborne  remote  sensing  imagery,  segmentation  algorithms 
aim  to  find  several  locally  homogeneous  regions  in  a  macroscopically  heterogeneous 
image.  These  areas  could  correspond  to  buildings,  different  types  of  planted  crops 
and  vegetation  cover,  roads,  or  areas  affected  by  natural  disasters.  While  all  pixels 
covering  these  types  of  areas  should  appear  similar  at  a  macroscopic  scale,  pixel- 
based  segmentation  generally  leads  to  significant  misclassification  error  within  these 
regions.  Incorporating  spatial  context  has  therefore  been  proposed  for  “smoothing 
out”  these  errors  to  yield  more  homogeneous  segmentation  regions  [62,63]. 

Anomaly  detection  is  another  problem  that  can  benefit  from  a  context-dependent 
learning  approach,  since  ambient  conditions  can  significantly  affect  the  statistical  dis¬ 
tributions  of  anomalous  sensor  data,  features  extracted  from  such  anomalies,  or  the 
confidence  values  of  anomaly  detection  algorithms  that  exploit  complementary  infor¬ 
mation.  In  any  of  these  spaces,  similar  observations  can  potentially  be  clustered  into 
discrete  contexts  that  are  representative  of  unique  environmental  conditions.  This 
process  is  referred  to  as  context  identification.  After  context  identification  is  per- 
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formed,  context-specific  models  can  be  trained  for  performing  anomaly  classification 
within  each  context.  The  decision  rule  for  each  of  these  classifiers  may  be  unique  for 
each  of  the  contexts  that  were  learned. 

Several  techniques  have  been  proposed  for  context-dependent  learning  to  assist 
with  landmine  detection  in  GPR  data  as  well  as  in  hyperspectral  imagery  (HSI).  One 
method,  known  as  Context  Extraction  for  Local  Fusion  (CELF)  [64],  was  proposed 
by  Frigui  et  al.  for  multi-sensor  fusion  (e.g.  GPR/EMI)  in  autonomous  landmine 
detection  systems.  The  CELF  algorithm  is  motivated  by  the  assumption  that  differ¬ 
ent  subsets  of  the  threat  population  will  respond  differently  to  different  sensors.  For 
example,  shallow  AP  landmines  are  more  easily  detected  with  an  EMI  sensor  than 
with  GPR,  because  their  GPR  signature  is  often  lost  in  the  ground  bounce.  There¬ 
fore,  the  EMI  sensor  should  be  relied  upon  more  heavily  when  those  types  of  targets 
are  encountered.  Conversely,  low-metal  AT  landmines  are  more  easily  detected  by 
GPR  than  EMI,  so  GPR  should  be  relied  upon  more  heavily  for  these  targets. 

In  CELF,  a  fuzzy  clustering  scheme  was  proposed  for  grouping  together  observa¬ 
tions  with  similar  responses  from  each  sensor,  and  these  clusters  describe  the  under¬ 
lying  contexts.  Learning  the  contexts  is  performed  discriminatively  by  optimizing  an 
objective  function  that  accounts  for  both  cluster  size  as  well  as  discriminability  of  ob¬ 
servations  in  each  cluster  by  a  linear  decision  rule.  Experimental  results  showed  that 
CELF  was  able  to  partition  large  data  sets  into  observations  with  similar  GPR/EMI 
responses,  and  it  achieved  better  classification  performance  than  either  individual 
sensor  as  well  as  a  conventional  linear  fusion  incorporating  no  contextual  informa¬ 
tion.  It  was  also  shown  that  CELF  can  be  applied  to  fusion  of  multiple  classification 
algorithms  for  the  same  sensor  type  [55].  For  example,  the  four  GPR  algorithms 
that  were  originally  compared  by  Wilson  et  al.  [41]  can  be  fused  differently  based 
on  the  underlying  context,  yielding  significant  improvements  in  performance  over 
conventional  algorithm  fusion. 
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Another  context-dependent  classification  technique,  originally  proposed  for  HSI, 
is  the  random  set  framework  (RSF)  proposed  by  Bolton  and  Gader  [65].  The  RSF 
treats  observation  populations ,  rather  than  individual  observations,  as  random  sets. 
The  random  sets  of  spectra  that  constitute  the  individual  contexts  were  represented 
by  a  germ-and-grain  model  [66],  which  allowed  for  tractable  modeling  of  irregular 
orientations  of  the  observation  space.  A  unique  GMM  classifier  (based  on  the  like¬ 
lihood  ratio  test)  was  then  trained  via  maximum-likelihood  for  each  of  the  learned 
contexts. 

The  RSF  differs  significantly  from  CELF  in  how  training  is  performed;  the  context 
model  is  trained  in  a  supervised  manner,  with  each  context  corresponding  to  the 
distinct  environmental  conditions  in  which  data  was  collected,  and  the  classifiers  are 
learned  independently  from  the  context  model.  The  germ-and-grain  model  is  learned 
by  minimizing  the  misclassification  error  between  contexts,  and  the  classifiers  are 
trained  by  expectation-maximization  of  the  GMM  parameters  for  each  class  using 
the  observations  found  in  each  context.  Experimental  results  illustrated  that  the 
RSF  achieved  better  classification  performance  than  GMM  classifiers  incorporating 
no  contextual  information,  including  several  baseline  algorithms  from  the  literature. 

Both  CELF  and  RSF  have  illustrated  the  potential  that  context-dependent  learn¬ 
ing  has  in  improving  overall  performance.  However,  the  approach  on  which  these 
techniques  are  based  can  be  improved  upon  further.  First,  in  both  CELF  and  RSF 
contextual  factors  are  learned  from  similarities  and  differences  in  target  responses. 
However,  it  may  be  desirable  in  some  applications  to  be  able  to  infer  the  context 
from  a  background,  since  it  can  be  generally  assumed  that  most  data  collected  in  the 
field  will  be  target-free.  For  example,  vehicular  route-clearance  systems  that  may 
travel  and  collect  data  for  many  kilometers  may  be  able  to  obtain  valuable  contextual 
information  from  the  background  before  encountering  a  target. 

Furthermore,  both  CELF  and  RSF  require  specification  of  the  number  of  con- 
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texts  to  be  learned  a  priori.  This  caveat  could  be  especially  problematic  in  situa¬ 
tions  where  the  number  of  contexts  that  can  potentially  be  encountered  is  unknown. 
Because  each  of  these  approaches  essentially  uses  a  mixture  model  to  partition  a 
high-dimensional  data  set  into  discrete  contexts,  the  context  model  can  easily  be 
overtrained  by  specifying  too  many  contexts,  or  undertrained  by  specifying  too  few. 
It  may  be  more  desirable  to  use  a  model  that  facilitates  learning  of  the  number  of 
contexts  that  best  explain  the  training  data,  while  also  facilitate  the  learning  of  new 
contexts  as  field  data  becomes  available. 

1.4  Novel  Contributions 

In  contrast  to  the  past  literature,  this  dissertation  is  based  on  a  different  interpreta¬ 
tion  of  context  for  anomaly  detection  applications.  While  past  techniques  by  Frigui 
et  al.  and  Bolton  et  al.  have  focused  on  context  being  a  property  of  individual 
sensor  observations,  the  algorithms  developed  in  this  work  interpret  context  as  the 
state  of  the  world  at  a  given  location  in  space  and  time.  Contextual  information  was 
extracted  from  raw  background  data  through  a  set  of  physically-motivated  features, 
which  were  developed  for  characterizing  various  environmental  properties.  Using 
these  features,  a  variety  of  nonparametric  context  models  were  trained  via  Bayesian 
methods  to  learn  a  distinct  number  of  contexts.  Then,  unique  algorithm  fusion 
weights  were  learned  for  each  of  the  contexts.  The  overall  classification  performance 
of  context-dependent  fusion  was  compared  to  the  leading  target  detection  algorithms 
from  the  literature,  as  well  as  conventional  algorithm  fusion  approaches. 

A  flowchart  outlining  the  general  procedure  for  context-dependent  learning,  as 
proposed  in  this  dissertation,  is  shown  in  Figure  1.7.  Given  a  set  of  observations  x, 
the  underlying  context  of  each  observation  is  first  identified  probabilistically  from  the 
contextual  features  x^'*.  The  resulting  context  posteriors,  p(cn  =  rri \ x^ ) ,  indicate 
the  probability  that  x„  was  observed  under  context  m ,  for  m  =  1,2 After 
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FIGURE  1.7:  Flowchart  illustrating  a  basic  context-dependent  classification  tech¬ 
nique. 

partitioning  the  training  data  into  M  contexts,  an  ensemble  of  M  binary  classifiers  are 
trained  on  the  target  features  x(T)  of  observations  from  each  context.  The  resulting 
within- context  target  posteriors,  p(Hi\x.\P ,  cn  =  m),  represent  the  probabilities  that 
x„  belongs  to  the  Hi  class,  given  that  it  was  observed  under  context  cn.  Finally, 
target  posteriors  p(Hi\x.n)  are  calculated  by  integrating  over  uncertainty  in  context: 

M 

p(#i|xn)  =  ^p(i7i|x^r),cn  =  m)p(cn  =  m  |x£c))  (1.1) 

771=1 

Context  learning  was  performed  using  features,  motivated  by  GPR  phenomenology, 
that  provide  a  low- dimensional  characterization  of  local  environmental  conditions. 
These  features  were  considered  separately  from  the  features  used  to  characterize  tar¬ 
gets  from  non-targets,  thereby  facilitating  learning  of  the  context  model’s  parameters 
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independently.  Experiments  were  performed  with  real  and  simulated  sensor  data  to 
illustrate  that  the  context  features  are  indicative  of  quantitative  environmental  prop¬ 
erties  which  represent  contextually-relevant  factors  in  subsurface  sensing. 

The  context  models  proposed  in  this  dissertation  are  based  on  nonparametric 
Bayesian  inference.  The  statistics  literature  has  proposed  several  nonparametric 
Bayesian  techniques  that  are  useful  in  learning  models  of  uncertain  order,  and  these 
models  facilitate  an  approach  to  learning  the  effective  model  order.  In  context  learn¬ 
ing,  this  amounts  to  learning  not  only  the  parameters  that  characterize  each  context’s 
distribution  in  feature  space,  but  also  the  number  of  contexts  present  in  the  training 
data. 

Several  distinct  context  models  are  proposed  in  this  dissertation.  Although  they 
all  are  essentially  mixture  densities  that  will  partition  the  data  into  M  components, 
they  differ  in  the  information  used  to  partition  the  data.  First,  approaches  that  as¬ 
sume  independence  of  observations  are  proposed.  These  include  a  Gaussian  mixture 
model  and  a  mixture  of  factor  analysis  models,  each  incorporating  a  Dirichlet  process 
prior  to  facilitate  learning  of  the  number  of  contexts  [67,68].  A  context  model  that 
incorporates  spatial  information  is  also  presented,  and  is  based  upon  an  HMM  with 
a  Dirichlet  process  prior  to  facilitate  learning  of  the  number  of  states  [69].  Compar¬ 
isons  are  made  between  the  different  types  of  context  models,  and  the  advantages  and 
disadvantages  of  using  each  are  discussed.  Furthermore,  the  merits  of  incorporating 
spatial  information  are  also  highlighted. 

Two  general  techniques  for  learning  the  proposed  context  models  are  used.  First, 
several  generative  context  learning  approaches  are  presented  that  consider  the  train¬ 
ing  of  the  context  model  as  an  independent  task  from  training  the  binary  target 
classifiers.  A  generative  approach  will  learn  the  model  that  best  explains  the  train¬ 
ing  data  by  maximizing  the  posterior  probability  of  the  model  parameters.  Such  a 
learning  approach  can  be  useful  in  scenarios  in  which  contextually-diverse  training 
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data  is  available,  and  can  potentially  avoid  overtraining  to  the  given  target/clutter 
population.  The  other  approach  is  discriminative  context  learning,  which  will  learn 
contexts  that  allow  for  the  best  discrimination  of  targets  from  non-targets.  This 
is  achieved  by  maximizing  the  posterior  probability  of  the  class  labels,  given  the 
training  data  and  the  context  model  parameters. 

Experimental  results  are  presented  for  using  context-dependent  learning  as  a 
means  for  improving  decision  fusion  of  several  detection  algorithms  used  in  fielded 
GPR  systems.  Performance  is  compared  to  the  individual  algorithms  as  well  as 
to  global  fusion.  In  addition,  results  are  presented  illustrating  the  performance 
of  context-dependent  learning  for  improving  anomaly  classification  in  HSI.  In  both 
types  of  problems,  context-dependent  learning  is  shown  to  achieve  higher  PD  and 
lower  FAR  than  conventional  machine  learning  approaches,  emphasizing  that  valu¬ 
able  contextual  information  can  be  exploited  from  the  background  data  to  improve 
sensing  robustness  in  the  presence  of  changing  environmental  conditions. 
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2 


Extracting  Contextual  Information  from  GPR  Data 


One  of  the  goals  of  this  work  is  to  statistically  model  the  distinct  contexts  present 
in  large  bodies  of  GPR  data  collected  over  varying  environmental  conditions.  How¬ 
ever,  the  dimensionality  of  raw  sensor  data  can  be  very  high.  For  example,  data 
collected  with  the  NIITEK  GPR  has  a  temporal  resolution  of  512  samples  and  a  5 
cm  spatial  sampling  rate,  and  a  B-scan  covering  the  entire  signature  of  a  target  could 
be  as  many  as  25  downtrack  samples  long.  Therefore,  vectorizing  the  B-scan  would 
result  in  a  64,000-dimensional  observation.  Furthermore,  many  dimensions  (i.e.,  pix¬ 
els)  of  raw  data  could  be  highly  correlated  (e.g.,  neighboring  pixels),  while  others 
could  be  non-informative  (e.g.,  pixels  above  the  ground  bounce).  High-dimensional 
data  is  very  difficult  to  model  statistically  due  to  the  oft-cited  curse  of  dimension¬ 
ality  [70-72],  which  suggests  that  the  number  of  required  training  samples  increases 
exponentially  with  the  number  of  dimensions.  Therefore,  in  order  to  effectively  model 
the  distribution  of  various  contextual  factors  in  GPR  data,  it  may  be  desirable  to 
utilize  low-dimensional  features  that  characterize  such  factors  and  are  amenable  to 
clustering. 

In  this  chapter,  physics-based  techniques  for  extracting  contextual  features  from 
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raw  GPR  data  are  described1.  GPR  phenomenology  suggests  that  a  single  B-scan 
may  contain  an  abundance  of  information  about  various  contextual  factors,  including 
surface  roughness,  soil  electromagnetic  properties,  the  presence  of  multiple  layers, 
and  the  subsurface  heterogeneity.  Direct  estimation  of  these  subsurface  environ¬ 
mental  properties  may  be  achieved  via  inverse  numerical  modeling  or  deconvolu¬ 
tion  [27,29-34],  However,  the  computational  complexity  of  inversion  and  deconvolu¬ 
tion  makes  real-time  implementation  of  these  approaches  infeasible. 

This  chapter  proposes  an  alternative  technique  for  extracting  contextual  infor¬ 
mation  from  GPR  background  data  using  several  features  that  were  developed  based 
upon  a  transmission  line  model  [10,39].  Statistical  classification  and  regression  mod¬ 
els  were  trained  on  the  features  to  predict  multiple  environmental  properties  from 
real  and  simulated  GPR  data.  Experimental  results  illustrate  that  the  proposed 
features  are  indicative  of  several  quantitative  factors  that  can  be  used  to  facilitate 
context  learning  in  buried  threat  detection  applications. 

2.1  Transmission  Line  Model  for  GPR 

A  simple  phenomenological  model  for  GPR  A-scans  can  be  motivated  by  electrical 
transmission  lines  [10,39].  In  a  similar  manner  to  a  signal  transmitted  down  a  trans¬ 
mission  line  with  several  impedance  mismatches,  a  GPR  signal  consists  of  several 
reflections  of  the  transmitted  pulse  at  various  amplitudes  and  delays.  According 
to  this  model,  each  received  pulse  therefore  corresponds  to  a  subsurface  interface. 
Figure  2.1  provides  a  basic  illustration  of  the  transmission  line  model  as  an  approxi¬ 
mation  of  a  heterogeneous  soil  environment.  Note  the  similarity  of  the  signal  derived 
from  such  a  model  to  a  typical  GPR  A-scan. 

1  This  chapter  is  derivative  of  previously  published  work,  ©  2012  IEEE.  Reprinted,  with  permis¬ 
sion,  from  Ratto  et  ah,  “Characterization  of  the  subsurface  environment  with  GPR  using  feature- 
based  statistical  learning,”  IEEE  Transactions  on  Geoscience  and  Remote  Sensing ,  in  review  as  of 
Feb.  2012. 
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FIGURE  2.1:  A  diagram  of  transmission  line  model  for  GPR  A-scans  [39].  Left: 
an  example  of  unique  dielectric  layers  in  subsurface.  Center:  the  corresponding 
transmission  line  with  three  characteristic  impedances.  Right:  An  A-scan  generated 
under  this  model. 


Several  broad  assumptions  are  made  by  modeling  GPR  A-scans  as  the  signal 
received  from  a  mismatched  transmission  line.  Multipath  effects  are  ignored,  propa¬ 
gating  waves  are  assumed  to  be  planar,  all  interfaces  are  assumed  planar  and  infinite 
in  extent,  and  that  the  respective  transmission  media  are  assumed  to  be  homoge¬ 
neous,  lossless,  and  non-dispersive.  However,  any  deviations  of  real  signals  from  the 
model  assumptions  may  be  accounted  for  by  a  statistical  model.  In  the  remainder 
of  this  chapter,  the  features  derived  from  the  transmission  line  model  are  described, 
and  experimental  results  illustrate  that  these  features  are  indicative  of  quantitative 
environmental  properties  via  statistical  inference. 

2.2  GPR  Contextual  Features 

A  variety  of  features  are  proposed  in  this  chapter  for  extracting  contextual  informa¬ 
tion  from  GPR  B-scans.  The  following  notation  is  used  in  describing  the  features: 
columns  of  the  B-scan  are  A-scans  denoted  as  a(t),  where  £  is  a  temporal  sample  index 
(t  =  1,2,  ...T);  rows  are  the  time-slices  denoted  as  b(n),  where  n  is  a  spatial  sample 
index  (n  =  1,2,  For  each  feature  that  is  described  in  this  section,  sample 
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feature  vectors  extracted  from  simulated  B-scans  are  shown.  The  simulated  B-scans 
were  generated  using  the  publicly-available  finite-difference  time-domain  (FDTD) 
modeling  software  GprMax  [73,74], 

Energy  features 

The  total  energy  of  an  A-scan  is  a  basic  feature  that  is  calculated  by  summing  the 
time  samples  of  a  squared  A-scan: 

T 

e  =  J>2(i)  (2.1) 

t=  1 

The  energy  feature  provides  information  regarding  several  properties  of  the  subsur¬ 
face  environment.  In  scenarios  where  the  GPR  antenna  is  close  to  the  ground,  there 
is  high  dielectric  contrast  between  the  air  and  the  ground,  or  the  soil  is  very  heteroge¬ 
neous,  the  energy  feature  should  have  a  high  value.  Furthermore,  scenarios  in  which 
the  GPR  antenna  is  high  above  the  ground,  the  soil  has  little  dielectric  contrast  with 
the  air,  or  the  subsurface  is  largely  free  of  inhomogeneities,  the  energy  feature  should 
yield  a  low  value. 

Reflection  coefficient  features 

In  transmission  lines,  the  degree  of  impedance  mismatch  is  often  expressed  in  terms 
of  reflection  coefficients,  i.e.  the  ratio  of  reflected  to  transmitted  power.  In  GPR, 
the  reflection  coefficient  at  the  air/ground  interface  is  of  particular  interest;  because 
the  dielectric  properties  of  air  are  usually  assumed  to  be  equal  to  those  of  free  space, 
the  air/ground  reflection  coefficient  may  characterize  subsurface  dielectric  properties. 
Accurate  estimation  of  the  air/ground  reflection  coefficient  must  take  into  account 
propagation  losses,  rather  than  simply  compare  the  ground  bounce  magnitude  to 
the  transmitted  power,  or  else  estimates  may  be  inaccurate  [75].  The  free-space  loss 
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(Lps)  of  a  line-of-sight  path  follows  the  power  law  given  by 


Lfs  — 


(2.2) 


where  distance  is  denoted  by  d,  A  is  the  signal’s  wavelength,  and  c  is  the  free-space 
propagation  speed.  For  a  given  distance,  transmit  and  receive  antennas  with  respec¬ 
tive  gains  Pt  and  Pr,  and  a  reflector  with  cross-sectional  area  A,  and  transmitted 
power  Pt,  the  received  power  Pr  can  be  expressed  as  a  function  of  the  reflection 
coefficient  T: 


PtGtGrA  2  PtGtGrc2  2 

I  R  =  — - — - i  =  ,  . 


(2,3) 


4:7id2LFs  (47t  )2/2# 

Solving  for  T,  and  consolidating  Pt,  Gt,  Gr,  A,  f ,  and  c  into  a  single  constant 
that  characterizes  the  radar  system,  the  reflection  coefficient  can  be  expressed  as 
proportional  to  a  function  of  distance  and  received  power: 


r  oc  d2  \/Pr. 


(2.4) 


In  GPR  data,  the  reflection  coefficient  can  be  approximated  by  applying  basic 
radar  ranging  to  the  approximate  ground  bounce.  First,  d  must  be  calculated  by 
dividing  the  ground  bounce  arrival  time  by  the  system’s  range  resolution  (expressed 
in  samples/m)  T\ 

d  =  tGB/S  (2.5) 

where  it  is  assumed  that 

tcB  =  argmax  a{t)  (2.6) 

t 

The  received  power  is  calculated  by  windowing  out  the  ground  bounce  from  an  A-scan 
using  the  Gaussian  function  w{t).  The  Gaussian  window  is  centered  on  the  midpoint 
between  the  A-scan’s  global  maximum  and  minimum,  and  its  width  is  specified  by 


the  constant  aw: 


t=  1 


w(t)  = 


Pr  =  J ^w(t)a2(t ) 

(a:  -  /iu,)2 


■x/2 


:  exp 


7rcrt 


2n2 


hw  t GB  T  (train  ^Gi?)/2 


cr,„  =  const. 


train  =  argmin  a(t) 


(2.7) 

(2.8) 

(2.9) 

(2.10) 

(2.11) 


Figure  2.2  illustrates  a  comparison  of  the  energy  and  reflection  coefficient  fea¬ 
tures  extracted  from  simulated  B-scans  generated  over  simulated  soils  with  different 
dielectric  properties.  The  top  panel  illustrates  a  soil  characterized  by  a  low  dielectric 
constant  (er  =  3),  and  the  bottom  panel  illustrates  a  soil  characterized  by  a  high  di¬ 
electric  constant  (er  =  10),  and  the  electrical  conductivities  of  both  soils  were  equal. 
The  plots  of  the  feature  values  illustrate  that  the  energy  and  reflection  coefficient 
values  are  higher  for  the  soil  with  high  dielectric  constant.  Note  that  the  values  of 
the  reflection  coefficient  feature  do  not  reflect  valid  reflection  coefficient  values  (i.e. , 
between  0  and  1)  because  the  scaling  constants  in  (2.3)  are  ignored. 


Matching  pursuits  features 


GPR  data  can  appear  very  cluttered  when  collected  over  heterogeneous  soils  due  to 
reflections  from  multiple  subsurface  interfaces,  and  it  may  be  useful  to  determine 
when  heterogeneous  soils  are  encountered.  One  technique  for  measuring  soil  hetero¬ 
geneity  based  on  the  transmission  line  model  is  to  determine  how  many  unique  pulses 
can  be  used  to  replicate  an  A-scan.  This  is  based  on  the  hypothesis  that  A-scans 
collected  over  heterogeneous  soils  would  consist  of  more  pulses  than  A-scans  col¬ 
lected  over  homogeneous  soils.  In  this  work,  the  matching  pursuits  (MP)  algorithm 
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Figure  2.2:  Examples  of  energy  and  reflection  coefficient  features  for  FDTD- 
simulated  B-scans.  Top:  soil  with  dielectric  constant  of  3.  Bottom:  soil  with  di¬ 
electric  constant  of  10.  The  simulated  B-scans  are  shown  at  left,  and  plots  of  the 
energy  and  reflection  coefficient  features  values  are  shown  at  right.  ©  2012  IEEE. 


proposed  by  Mallat  and  Zhang  [76])  is  used  to  approximate  an  A-scan  as  a  sum  of 
unique  pulses,  which  are  selected  from  a  dictionary  D  =  {d(o2, 70) }  of  differentiated 
Gaussian  elements  with  varying  widths  u  and  temporal  positions  to'. 

d(cj,  t0)  =  exp  (  ^  )  ’  t°  =  1,  2’  T  (2-12) 

The  basic  MP  algorithm  first  correlates  each  dictionary  element  with  the  origi¬ 
nal  signal,  a  =  (a(t)},  then  subtracts  from  the  signal  the  most-correlated  element 
weighted  by  its  correlation.  The  process  is  then  repeated  using  the  residual  sig¬ 
nals,  and  continues  until  the  change  in  energy  falls  below  a  specified  threshold  (<50)- 
Algorithm  1  summarizes  the  application  of  MP  to  GPR  A-scans. 

From  the  set  of  selected  dictionary  elements,  features  can  be  extracted  to  char¬ 
acterize  subsurface  heterogeneity.  One  feature  is  the  number  of  iterations  for  MP  to 
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converge  ( ump ),  which  is  analogous  to  the  number  of  unique  pulses  that  make  up 
an  A-scan.  From  the  transmission  line  model,  each  pulse  corresponds  to  a  unique 
reflection  at  a  dielectric  interface.  Therefore,  the  number  of  MP  iterations  may  char¬ 
acterize  the  amount  of  subsurface  heterogeneity.  Soils  with  low  levels  of  heterogeneity 
should  yield  lower  values  of  timp  than  soils  with  high  levels  of  heterogeneity. 


Algorithm  1  Basic  matching  pursuits 

input  a,  d,  50,  oj 

n  —  0 


76] 


while  5E  <  50  do 

n  —  n  +  1 

for  to  =  1,  2, ...,  T  do 

p(u,t0)  =  d(cn,  to)7  r/|  |d(cu,  t0)  1 12 

end  for 

t'0n  =  argmax  p(u,  t0) 
to 

r  =  r-p(u,t'0n)d(u,t'Qn) 

En  =  1 1  r  1 1 2 
5E  =  En- 1  —  En 

end  while 


«mp  —  n 

a  =  a  —  r 

return  bmpAAq' 


Another  feature  derived  from  MP  is  the  temporal  histogram  of  the  selected  dic¬ 
tionary  elements,  denoted  by  h mp-  The  histogram  bins  correspond  to  the  temporal 
centers  t\  and  are  experimentally  determined.  The  goal  of  using  the  MP  histogram 
is  to  differentiate  between  soils  of  varying  heterogeneity,  using  the  hypothesis  that 
as  heterogeneity  increases  so  will  the  number  of  late-time  reflections.  The  number 
of  late-time  reflections  would  be  reflected  in  the  late-time  values  of  the  histogram. 

Figure  2.3  illustrates  the  MP  features  for  two  B-scans.  The  top  plots  illustrate  a 
simulated  B-scan  over  soil  with  low  heterogeneity  and  corresponding  MP  histogram, 
while  the  bottom  plots  corresponds  to  a  highly  heterogeneous  soil.  Note  the  dif¬ 
ferences  in  the  number  of  MP  iterations  and  late-time  histogram  bins.  The  more 
heterogeneous  soil  yields  a  higher  number  of  total  reflections,  with  a  greater  propor- 
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Simulated  B-Scan  (Aligned):  Low  Heterogeneity 
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FIGURE  2.3:  Example  of  MP  histogram  extracted  from  B-scans  over  soils.  Top- 
left:  Simulated  B-scan  of  low-heterogeneity  soil;  Top-right:  number  of  MP  iterations 
until  convergence  and  MP  histogram  for  low-heterogeneity  soil;  Bottom-left:  Simu¬ 
lated  B-scan  of  high-heterogeneity  soil;  Bottom-right:  number  of  MP  iterations  until 
convergence  and  MP  histogram  for  high-heterogeneity  soil.  ©  2012  IEEE. 


tion  of  them  occurring  late  in  time.  Therefore,  MP  places  more  dictionary  elements 
in  the  later  portion  of  the  signal.  Since  the  ground  bounce  is  the  portion  of  the 
A-scan  with  the  highest  local  energy,  the  majority  of  dictionary  elements  would  be 
selected  to  describe  that  portion  of  the  signal.  To  prevent  this  from  occurring,  and 
to  obtain  more  information  regarding  subsurface  reflections,  MP  was  only  run  on 
A-scan  samples  that  occur  after  iGB. 

Linear  prediction  features 

Linear  prediction  (LP)  Liters  provide  causal  estimates  of  the  power  spectrum  of  a 
signal  [77] .  In  GPR  applications,  LP  Liters  have  been  found  to  be  particularly  useful 
as  anomaly  detectors  [36-40].  Linear  predictors  can  be  applied  to  time-slices  (rows) 
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of  GPR  data,  so  that  anomalies  could  be  characterized  by  high  prediction  error.  This 
result  is  because  the  LP  filter  is  based  on  assumptions  similar  to  the  transmission 
line  model  -  planar  interfaces  infinite  in  extent  should  yield  the  same  response  for 
all  locations,  and  deviations  from  that  planar  assumption  are  considered  as  random 
noise.  Therefore,  the  behavior  of  LP  Liters  may  characterize  “steady-state”  environ¬ 
mental  properties  such  as  soil  dielectric  constant  [78] ,  as  well  as  stochastic  properties 
such  as  surface  roughness  and  heterogeneity  [79]. 

The  equation  for  a  LP  Liter  takes  the  form  of  an  autoregressive  model  of  order 
K,  characterized  by  weights  a  =  [a(l),  a( 2), ...,  a(K )]T,  which  are  applied  to  B-scan 
time-slices  b(n)  =  [b(n),  b(n  —  1), ...,  b(n  —  K)].  The  Liter  outputs  a  zero-mean,  white 
noise  process  e(n)  with  variance  is: 


K 


a(k)b(n  —  k)  =  e(n). 

k=  1 

(2.13) 

The  weights  can  be  determined  from  the  normal  equations, 

a  =  R-1p, 

(2.14) 

where  the  correlation  matrix  (R)  and  cross-correlation  vector  (p) 

are  given  by 

R  =  E  [b(n  —  l)bH(n  —  1)] 

(2.15) 

p  =  E  [b(n  —  l)6(n)] . 

(2.16) 

Given  the  weights,  the  prediction-error  power  (is)  can  be  found  by  calculating  the 
mean-square  error  of  the  Liter  applied  to  the  data: 

is  —  E  [| b(n)  —  ctTb(n  —  1)|2]  (2-17) 

=  cr2  —  2aTp  +  ol1  R.« 

To  extract  the  LP  features  from  GPR  data,  a  B-scan  is  Lrst  aligned  according  to 
each  column’s  toB,  and  all  data  up  to  and  including  t,cB  are  discarded.  LP  Liters  of 
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order  K  are  then  trained  and  evaluated  on  aligned  B-scan  rows  bt/,  where  t'  are  10 
experimentally-determined  row  indices.  The  calculated  values  of  utt  are  concatenated 
into  a  feature  vector  of  length  equal  to  the  number  of  rows  used.  In  line  with  past 
investigations  [78,79],  models  of  order  M  —  4  were  used.  Although  different  values 
of  M  were  considered,  the  overall  effect  of  changing  the  model  order  on  prediction 
error  was  not  significant. 

Figure  2.4  illustrates  two  aligned  B-scans  corresponding  to  soils  with  different 
dielectric  and  surface  roughness  properties  (shown  at  left),  and  compares  the  differ¬ 
ences  in  corresponding  prediction-error  power  (shown  at  right)  .  In  the  top  panel,  the 
features  corresponding  to  a  soil  with  low  dielectric  constant  and  high  “roughness” 
(characterized  by  a  surface  with  low  correlation  length)  are  shown,  and  the  corre¬ 
sponding  prediction-error  power  tends  to  be  high.  As  shown  in  the  bottom  panel, 
a  soil  with  high  dielectric  constant  and  low  “roughness”  (characterized  by  a  highly- 
correlated  surface)  yields  more  predictable  data,  and  therefore  the  prediction-error 
power  is  much  lower  and  is  more  constant  with  respect  to  time  slice  index. 

2.2.1  Feature  consolidation 

Several  of  the  proposed  contextual  features  (e,  T,  ump)  are  extracted  from  individual 
A-scans,  and  it  is  important  to  eliminate  redundancy  and  spatial  dependence  in 
feature  extraction.  Therefore,  these  features  were  averaged  across  the  columns  of 
the  B-scan  from  which  they  were  extracted.  Table  2.2.1  summarizes  the  elements 
of  the  23-dimensional  feature  vector,  x(r)  =  [e,  T,  ump,  ^mp,  v],  to  provide  a  low¬ 
dimensional  representation  of  the  B-scan’s  contextual  information.  In  all  experiments 
performed  on  these  features,  the  dimensions  of  x(F>  were  normalized  to  be  zero  mean, 
unit  variance  prior  to  further  processing. 

The  following  sections  illustrate  the  efficacy  of  the  proposed  features  in  charac¬ 
terizing  multiple  subsurface  environmental  properties.  Experiments  were  performed 
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FIGURE  2.4:  Example  of  LP  prediction-error  power  extracted  from  aligned  B-scans 
over  simulated  soils.  Top-left:  simulated  aligned  B-scan  of  soil  with  low  dielectric 
constant  and  low  surface  correlation  length.  Top-right:  LP  prediction-error  power, 
measured  as  a  function  of  temporal  index,  for  soil  with  low  dielectric  constant  and  low 
surface  correlation  length.  Bottom-left:  simulated  aligned  B-scan  of  soil  with  high 
dielectric  constant  and  high  surface  correlation  length.  Bottom-right:  LP  prediction- 
error  power,  measured  as  a  function  of  temporal  index,  for  soil  with  high  dielectric 
constant  and  high  surface  correlation  length,  (c)  2012  IEEE. 


Table  2.1:  Features  (x(<A)  for  classification  and  regression  of  environmental  param¬ 
eters. 


Element 

Description 

#  of  Dimensions 

e 

Average  A-scan  energy 

1 

r 

Average  reflection  coefficient 

1 

flMP 

Average  #  MP  iterations 

1 

h  mp 

MP  Temporal  Histogram 

10 

V 

LP  filter  prediction-error  power 

10 

35 


using  both  simulated  and  field-collected  GPR  data,  and  preliminary  analysis  was 
originally  presented  in  [80] .  Descriptions  of  the  data  sets  are  also  provided,  including 
the  settings  of  feature  extraction  parameters  for  each  experiment. 

2.3  Evaluating  GPR  Contextual  Features:  Simulated  Data  Experi¬ 
ment 

The  first  experiment  to  test  the  efficacy  of  GPR  contextual  features  was  performed 
on  simulated  GPR  data  with  known  environmental  properties.  Simulated  B-scans 
were  generated  using  the  publicly-available  GprMax  software  [73,74],  which  is  based 
on  the  finite-difference  time  domain  (FDTD)  modeling  technique  [81,82],  Simulated 
B-scans  were  constructed  by  displaying  the  measured  electric  field  as  a  function  of 
time  at  a  series  of  fixed  locations  corresponding  to  the  receiving  antenna’s  position. 

GPR  data  was  simulated  over  many  realizations  of  a  soil  environment  with  several 
parameters,  some  random  and  others  deterministic.  The  soil  environment  consisted 
of  a  random  rough  surface,  homogeneous  soil  background,  and  random  subsurface 
scatterers,  and  is  characterized  by  four  environmental  model  parameters.  B-scans 
collected  over  the  simulated  soil  were  meant  to  approximate  target-free  background 
data.  Contextual  features  were  extracted  from  the  simulated  B-scans,  and  the  sim¬ 
ulation  parameters  were  predicted  from  the  features  via  relevance  vector  machine 
(RVM)  regression  and  classification  [83,84],  Details  regarding  RVM  implementation 
can  be  found  in  Appendix  B. 

2.3.1  Simulated  Data  Set 

The  two-dimensional  computational  domain  5  m  x  60  cm  with  a  spatial  resolution 
of  2.5  mm.  The  computational  domain  was  surrounded  by  a  perfectly  matched  layer 
(PML)  boundary  condition,  which  is  necessary  to  absorb  any  extraneous  reflections 
of  the  electromagnetic  fields  off  the  edges  of  the  domain.  A-scans  were  measured 
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as  the  received  electric  field,  collected  at  spatial  intervals  of  5  cm,  yielding  a  total 
of  100  A-scans  per  simulation.  The  transmit  and  receive  elements  were  modeled  as 
co-located  infinite  line  sources,  polarized  in  the  perpendicular  location,  and  located 
10  cm  above  the  mean  surface  elevation.  The  transmitter  was  excited  by  a  Gaussian 
current  pulse  with  center  frequency  fc  =  2  GHz,  therefore  yielding  a  differentiated 
Gaussian  pulse  in  the  electric  field.  FDTD  was  run  with  a  time  gate  of  6.6  ns,  i.e. 
the  round-trip  travel  time  for  a  propagation  distance  of  1  m  in  air,  with  a  temporal 
resolution  of  S'  =  1120  time  samples  per  A-scan.  Therefore,  each  FDTD  simulation 
yielded  an  1120  x  100  B-scan  consisting  of  received  electric  field  as  a  function  of 
time  and  receiver  location. 

The  computational  domain  included  a  soil  half-space  characterized  by  a  variety 
of  model  parameters  specified  a  priori.  The  soil  surface  was  stochastically  generated 
from  a  Gaussian  power  spectrum,  characterized  by  the  correlation  length  parameter 
(/O urf))'  T}ie  homogeneous  soil  background  was  characterized  by  a  range  of  values 
of  dielectric  constant  (ef01^)  and  conductivity  (cdso^)).  Finally,  the  subsurface  het¬ 
erogeneities  were  of  random  quantity,  characterized  by  a  binomial  distribution  with 
mean  N(scats),  Several  examples  of  the  computational  domain  are  shown  in  Figure  2.5 
alongside  the  corresponding  simulated  B-scans.  Each  example  illustrates  a  unique 
combination  of  model  parameters.  In  the  following  subsections,  the  generation  of 
the  various  elements  of  the  simulated  soil  are  described  in  detail. 

Rough  soil  surface 

A  common  technique  for  modeling  rough  surfaces  in  scattering  experiments  is  to 
model  the  surface  as  a  stochastic  process  f(n)  (where  n  denotes  spatial  index)  with 
a  Gaussian  spectrum  [24-28].  The  surface  profile  is  realized  by  passing  white  noise 
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Computational  Domain:  s{soil)=2,  c)(soi|)=1  e-07 ,  l(sur,)=0.25X,  N(scats)=100 

S' 
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Simulated  B-Scan:  sfS0il)=2,  l(sur,)=0.25*.1  N(scats)=100 


0  1  2  3  4  5 

Position  (m) 


N 


LU 


Computational  Domain:  ^soil)=5,  c^'Me-OS,  N(scats)=300 


_  0 
0.2 
S.  0.4 

CD 

Q  0.6 


0  1  2  3  4  5 


Position  (m) 

Simulated  B-Scan:  gCsoiO^g  O(soil)=1  e.05  |(surf)=1  x  N(scats)=300 


0  1  2  3  4  5 

Position  (m) 


dJ" 


N 


LU 


Computational  Domain:  gCso'0=i  o ,  c/so®=0.1 ,  |(sur,)=2A.,  |v|(scats)=500 
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Figure  2.5:  Examples  of  computational  domain  and  FDTD-simulated  GPR  data. 
Top:  )  elsoil)  =  2,  a(soU')  =  10“7,  =  0.25A,  Mscats )  =  100.  Center:  e{rsoil)  =  5, 

a(s°il)  =  10-5;  t(surf)  =  1A)  N(scats)  =  ggg.  Bottom:  =  10,  =  10"1, 

l (surf)  =  2A,  N(scats ')  =  500.  The  top  plot  of  each  panel  illustrates  the  computational 
domain,  where  the  horizontal  axis  represents  position,  the  vertical  axis  represents 
depth,  and  color  represents  dielectric  constant.  The  bottom  plot  illustrates  the 
corresponding  simulated  B-scan,  where  the  horizontal  axis  represents  position,  the 
vertical  axis  represents  time,  and  color  represents  the  received  electric  field  amplitude. 
©  2012  IEEE. 
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through  a  filter  with  spatial  frequency  response  H(k ): 


H(k) 


l  (sur  f)  J^(sur  f)2 


exp 


-k2l{surf)2 


(2.18) 


where  l^surR  and  h(-sur^~  are  parameters  for  correlation  length  and  variance,  respec¬ 
tively.  In  this  experiment,  rough  surfaces  were  generated  using  parameter  values 
suggested  in  [28].  The  value  of  h<ySur^  was  fixed  at  Ac/20,  where  Ac  is  the  wavelength 
corresponding  to  the  center  frequency  of  the  GPR  pulse.  The  value  of  l^surR  was 
variable,  with  potential  values  of  {Ac/4,  Ac/2,  Ac,  2AC}. 

Soil  background 

The  soil  half-space  was  characterized  by  a  homogeneous  background  with  spatially- 
invariant  values  of  dielectric  constant  elsoll\  conductivity  a^sml\  and  permeability 
li)sod)  =  i_i{) .  Dispersion  effects  were  ignored,  so  all  electromagnetic  parameters  were 
also  assumed  constant  with  respect  to  frequency.  Both  eisod'>  and  a^sml^  were  variable, 
with  potential  values  to  characterize  a  wide  range  of  soils  as  tabulated  in  [10]:  eiso,l)  = 
{2,  3, ...,  10}  and  =  {10~7, 10"6, ...,  lO”1}. 

Random  Scatterers 

Heterogeneity  in  the  soil  half-space  was  modeled  by  overlaying  many  random  box- 
shaped  scatterers  onto  the  background  medium,  in  a  manner  similar  to  that  used 
in  [22],  The  lower  left-hand  coordinates  of  the  scatterers  were  uniformly-distributed: 
p^(scai)  ^  U(0,  5m),  y(scat)  ~  14(0,  max  f(n)  —  2.5cm).  The  dimensions  of  the  scatter¬ 
ers  were  also  uniformly-distributed:  x  ^U.(d,  20cm),  y  ~W(d,  20cm).  The  dielectric 
constant  of  each  scatterer  (eiscat'>)  was  also  random,  but  drawn  from  a  distribution 
that  allowed  them  to  appear  as  perturbations  from  the  background: 

e(scat)  =  1  +  ~r  (2.19) 
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lr  ~  logA /'(/X,  cr). 

(2.20) 

where 

3  =  lQg  ( 

\yv  +  mzJ 

(2.21) 

cr  =  ^/log(l  +  v/m 2) 

(2.22) 

m  =  e{rsoil)  -  1 

(2.23) 

v  =  0.5. 

(2.24) 

Drawing  values  from  this  distribution  ensures  >  1  and  E[er'sca^] 

=  e{rsoil).  The 

conductivity  of  all  scatterers  was  fixed  at  a^sml\  Finally,  the  number  of  scatter¬ 
ed  present  in  the  soil  half-space  was  drawn  from  a  binomial  distribution,  n  ~ 
bin{2N^scats\p  =  0.5),  where  N(scats)  is  a  variable  parameter  indicating  the  ex¬ 
pected  number  of  subsurface  scatterers.  Three  potential  values  of  N(scats)  were  used: 
{100,300,500}. 

Size  of  Data  Set 

In  total,  there  were  756  possible  combinations  of  the  variable  soil  parameters  (9 
different  values  of  elsml\  7  values  of  <j(sml\  4  values  of  l(sur^ /X,  3  values  of  l\dSCQis)). 
Two  unique  simulations  were  performed  for  each  combination  of  these  parameters, 
yielding  a  total  of  1512  B-scans  from  which  features  were  extracted. 

2.3.2  Feature  Extraction 

All  of  the  features  described  in  Section  2.2  were  extracted  from  the  1512  simulated 
B-scans.  Table  2.2  lists  the  parameters  were  used  in  order  to  extract  features  from 
this  data.  The  values  of  aw,  S0,t',  and  K  were  determined  experimentally  to  yield  the 
best  overall  performance,  the  value  of  C  is  arbitrary,  the  value  of  c o  was  determined 
by  inspection  of  the  transmitted  GPR  waveform,  and  the  values  of  N  and  T  are 
artifacts  of  the  data  simulation. 
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Tabic  2.2:  Feature  extraction  parameters  for  simulated  data  experiment 


Parameter 

Description 

Value 

N 

Number  of  A-scans  per  B-scan 

100 

T 

lm  temporal  sampling  rate 

1120 

Window  width  parameter 

30 

MP  convergence  threshold 

0.01 

U) 

Width  of  MP  dictionary  elements 

200 

t' 

B-scan  row  indices 

O 

O 

o' 

o' 

K 

Linear  predictor  filter  order 

4 

2.3.3  Correlation  Analysis 

Pairwise  correlations  between  the  features  and  the  labels  [e[sol/\  a^sml\  l^surf  \  ]\f{scats) ] 
were  calculated  to  illustrate  the  efficacy  of  each  feature  in  characterizing  one  or  more 
soil  properties.  Figure  2.6  illustrates  the  correlation  of  each  feature  with  each  of  the 
four  labels.  The  values  of  eisml>  were  correlated  (or  inversely  correlated)  with  most  of 
the  features.  This  is  because  efi0 lV>  greatly  affects  the  signal  amplitude,  both  at  the 
air/ground  interface  and  within  the  soil  itself,  and  most  of  the  proposed  features  are 
functions  of  signal  amplitude.  Furthermore,  a^smll  was  correlated  with  the  matching 
pursuits  histogram,  suggesting  that  this  feature  may  be  indicative  of  the  attenuation 
of  signals  as  a  function  of  time.  The  values  of  are  most  correlated  with  the 

early-time  measurements  of  LP  power,  suggesting  that  the  most  unpredictable  rows 
of  the  B-scan  may  be  due  to  rough  surface  scattering.  Finally,  did  not 

correlate  as  highly  with  the  matching  pursuits  histogram  as  originally  hypothesized. 
This  could  be  due  to  insufficient  binning  of  the  matching  pursuits  histogram,  or  not 
enough  variation  in  the  values  of  N(scats)  considered  in  this  experiment. 

2.3.4  Classification  Results 

A  RVM  was  used  to  classify  features  extracted  from  the  simulated  GPR  data  ac¬ 
cording  to  the  known  soil  properties.  For  each  multi-class  problem,  the  RVM  was 
trained  using  a  one-against-all  approach,  and  test  observations  were  assigned  to  the 


41 


Correlation  of  Features  with  Context  Labels 


Figure  2.6:  Plot  of  correlations  between  features  (horizontal  axis)  and  soil  labels 
(line  color)  for  the  simulated  GPR  experiment.  ©  2012  IEEE. 

maximum  a  posteriori  (MAP)  class.  Classification  of  B-scans  according  to  the  soil 
labels  with  the  RVM  was  evaluated  via  10-folds  cross-validation.  Results  are  shown 
in  Figure  2.7.  Each  confusion  matrix  illustrates  overall  classification  performance, 
with  truth  listed  on  the  vertical  axis  and  classification  result  on  the  horizontal.  The 
percent  of  B-scans  classified  correctly  is  shown  at  the  top  of  each  confusion  matrix. 
For  some  of  the  labels,  classification  was  very  good  -  classifying  the  simulated  GPR 
data  by  Cr  *  yielded  an  overall  accuracy  of  97.24%  (compared  to  11.1%  chance 
accuracy),  and  classification  by  /(so*9 /\c  yielded  an  accuracy  of  90.41%  (compared 
to  25%  chance  accuracy).  The  result  of  classification  by  was  still  relatively 

good,  achieving  an  correct  classification  rate  of  76.5%  (compared  to  33.3%  chance 
accuracy).  Classification  by  yielded  good  performance  in  identifying  conditions 
with  very  high  values  of  conductivity,  while  lower  conductivities  were  often  confused. 
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Confusion  Matrix:  s£soi|)  (97.29%  Correct) 


Response 


Confusion  Matrix:  o^0'1-1  (35.38%  Correct) 
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Confusion  Matrix:  l(surf)A  (90.41  %  Correct) 
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Confusion  Matrix:  N(scats;i  (76.52%  Correct) 
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Figure  2.7:  Confusion  matrices  illustrating  results  of  RVM  classification  of  e\sml'> 
(top-left),  (j(smV>  (top-right),  l^soll^/\c  (bottom-left),  and  N('scats'>  (bottom-right)  for 
the  simulated  data  experiment.  Vertical  axes  indicate  the  the  true  labels,  and  hori¬ 
zontal  axes  indicate  the  classifier  response. 


However,  others  have  illustrated  that  the  overall  effect  of  soil  conductivity  on  GPR 
is  minimal  unless  the  conductivity  is  very  high  [20].  The  performance  of  the  RVM 
in  classifying  B-scans  by  soil  conductivity  confirms  these  observations. 

2.3.5  Regression  Results 

Unlike  classification,  which  makes  “hard”  decisions,  regression  allows  for  the  quan¬ 
titative  estimation  of  the  underlying  soil  parameters.  RVM-based  regression  results 
are  shown  in  Figure  2.8.  Each  plot  shows  the  regression  output  for  each  observation 
(dashed  line)  and  the  true  values  of  the  soil  parameters  (solid  line).  The  goodness- 
of-fit  is  summarized  by  the  RMS  error,  which  is  shown  above  each  plot. 

As  in  classification,  RVM  regression  was  able  to  very  accurately  estimate  eisod'1 
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Estimating  ejsoil):  RMSE  =  0.11217 


Estimating  a*30®:  RMSE  =  0.005957 


Estimating  l(surt)/J.:  RMSE  =  0.29765 


Estimating  Ntscats;i:  RMSE  =  589.0012 


500  1000 

Sample  Index 


FIGURE  2.8:  Results  of  RVM  regression  for  predicting  el3011'1  (top-left),  cdsod)  (top- 
right),  l^sml)  (bottom-left),  and  I\dscats)  (bottom-right)  for  the  simulated  data  exper¬ 
iment.  The  blue  line  in  each  plot  indicates  the  true  values  of  each  parameter,  and 
the  dashed  red  line  indicates  the  regression  estimate. 


(RMSE  =  0.11217)  and  lSsml^ /Ac  (RMSE  =  0.29765).  Furthermore,  estimation  of 
a(sml)  was  accurate  only  for  the  highest  values,  yielding  an  RMSE  of  0.006.  The  only 
major  difference  between  the  regression  and  classification  results  was  the  estimation 
of  j\dscats),  yielded  a  RMSE  of  589.  This  may  be  due  to  the  fact  the  N (scats) 

is  the  expected  number  of  subsurface  scatterers,  rather  than  the  exact  number. 

Overall,  however,  regression  results  illustrate  that  it  is  possible  to  not  only  predict 
the  different  environmental  conditions  that  were  imposed  on  the  simulated  data,  but 
also  how  much  different  those  conditions  are.  In  practical  applications,  it  may  be 
useful  for  a  context-dependent  processing  strategy  to  tell  when  the  underlying  soil 
context  changes  dramatically  (such  as  after  a  heavy  rainfall)  rather  than  subtly  (such 
as  a  light  misting  of  rain). 
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2.4  Evaluating  GPR  Context  Features:  Field  Data  Experiment 

A  second  experiment  was  performed  to  evaluate  the  GPR  features  on  field-collected 
data.  The  features  were  used  to  predict  measurements  of  soil  moisture  and  tem¬ 
perature  by  a  meteorological  station  at  an  Eastern  U.S.  government  test  site.  The 
following  subsections  describe  the  data  set  used  in  this  experiment,  and  present  the 
results  of  RVM  regression. 

2-4-1  Field- collected  data  set 

The  GPR  data  used  in  this  experiment  was  collected  at  an  temperate  Eastern  U.S. 
government  test  site  for  a  total  of  12  days,  over  4  campaigns  of  2-5  days  each  be¬ 
tween  March  and  August,  2008.  The  data  collection  site  was  comprised  of  two  dirt 
and  three  gravel  test  lanes  in  which  anti-tank  landmines  were  emplaced.  As  data 
was  collected,  the  GPR  operator  maintained  an  array  height  of  approximately  7-8” 
above  the  ground.  The  GPR  made  several  overlapping  passes  down  each  lane,  in 
opposite  directions,  to  ensure  than  the  entire  width  of  the  lane  was  covered.  For  this 
experiment,  only  the  first  and  last  passes  on  each  lane  from  each  day  are  considered 
to  ensure  maximum  possible  change  in  soil  conditions  between  passes. 

A  meteorological  station  was  installed  at  the  test  site  to  collect  various  data  re¬ 
garding  air  and  soil  conditions.  Figure  2.9  shows  a  photograph  of  the  meteorological 
station  located  between  a  dirt  lane  and  a  gravel  lane.  The  station  recorded  air  tem¬ 
perature,  humidity,  atmospheric  pressure,  wind  speed,  wind  direction,  precipitation, 
dirt  temperature  (at  depths  of  1/2,  2,  4,  and  8  in.),  gravel  temperature  (at  depths  of 
1/2,  2,  4,  and  8  in.),  soil  moisture  (at  depths  of  2,  4,  and  8  in.),  short-wave  radiation 
(both  up  and  down- welling) ,  and  long- wave  radiation  (both  up  and  down- welling)  at 
5-minute  intervals. 

Only  measurements  of  dirt  temperature,  gravel  temperature,  and  soil  moisture 
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Figure  2.9:  Photograph  of  the  meteorological  station  located  at  the  Eastern  US 
test  site.  The  station  is  located  between  a  dirt  and  gravel  lane,  with  soil  probes 
embedded  in  each  lane. 

were  used  in  this  experiment,  since  the  other  measurements  were  determined  to  be 
irrelevant  or  did  not  show  significant  variation.  The  soil  measurements  were  averaged 
over  depth  since  the  only  variation  with  respect  to  depth  appeared  to  be  scaling. 
The  soil  measurements  were  also  averaged  over  each  day  since  accurate  timestamp 
information  was  not  available  to  cross-register  the  GPR  data  with  the  meteorological 
data. 

2-4-2  Feature  Extraction 

After  the  data  was  collected,  a  prescreener  was  run  on  the  raw  GPR  data  to  flag  lo¬ 
cations  of  detected  anomalies,  and  background  B-scans  of  length  100  were  extracted 
prior  to  each  prescreener  alarm.  Examples  of  the  background  data  prior  to  pre¬ 
screener  alarms  are  shown  in  Figure  2.10.  The  23-dimensional  contextual  features 
were  then  extracted  from  the  background  B-scans.  Table  2.3  lists  the  parameters 
that  were  set  for  performing  feature  extraction  on  the  field  data. 
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Example  Field-Collected  B-Scan:  Gravel  Lane 
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Figure  2.10:  Example  B-scans  of  field-collected  GPR  data  collected  on  dirt  (top) 
and  gravel  (bottom)  lanes  at  an  Eastern  US  test  site.  The  images  show  background 
data  collected  prior  to  a  prescreener  alarm.  The  anomaly  that  was  flagged  can  be 
seen  at  the  far  right  of  each  image.  ©  2012  IEEE. 


Example  Field-Collected  B-Scan:  Dirt  Lane 


Table  2.3:  Feature  extraction  parameters  for  field  data  experiment 


Parameter 

Description 

Value 

N 

Number  of  A-scans  per  B-scan 

100 

T 

lm  temporal  sampling  rate 

512 

Window  width  parameter 

5 

<5o 

MP  convergence  threshold 

0.01 

uj 

Width  of  MP  dictionary  elements 

25 

t' 

B-scan  row  indices 

0,30,. ..,300 

K 

Linear  predictor  filter  order 

4 
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2-4-3  Correlation  Analysis 


To  assess  the  efficacy  of  each  individual  feature  in  characterizing  the  meteorological 
data,  each  feature  was  correlated  with  the  soil  measurements  and  the  correlations  are 
plotted  in  Figure  2.11.  The  results  are  quite  intuitive;  soil  moisture  is  most  corre¬ 
lated  with  the  energy,  reflection  coefficient,  and  early-time  LP  power  features,  since 
moisture  has  a  great  impact  on  overall  soil  permittivity.  Because  the  measurements 
of  dirt  and  gravel  temperature  were  very  similar,  both  measurements  are  correlated 
with  the  late-time  MP  histogram.  If  we  recall  the  results  of  the  simulated  data 
experiment,  the  late-time  MP  histogram  was  most  correlated  with  conductivity.  A 
relationship  has  been  shown  to  exist  between  soil  conductivity  and  temperature  [85] , 
and  is  probably  related  to  the  drying  of  soils  as  temperature  increases.  Therefore,  the 
correlation  analysis  suggests  that  the  matching  pursuits  histogram  may  be  indicative 
of  soil  temperature  as  well  as  conductivity. 

2-4-4  Regression  Results 

As  in  the  simulated  data  experiment,  regression  was  performed  on  the  kernel-mapped 
features  using  the  RVM  and  evaluated  via  10-fold  cross-validation.  Results  of  using 
RVM  regression  to  predict  the  soil  measurements  from  the  contextual  features  are 
shown  in  Figure  2.12.  Because  the  measurements  of  dirt  and  gravel  temperature  were 
similar,  the  regression  performance  was  also  similar,  achieving  estimation  accuracy 
within  5-6  degrees  (the  RMSE  for  dirt  temperature  was  5.05,  and  for  gravel  was  6.05). 
More  importantly,  these  results  illustrate  that  the  RVM  is  able  to  distinguish  between 
major  differences  in  temperature.  Soil  moisture  was  estimated  with  relative  accuracy 
for  higher  values  (>  0.14),  but  there  appears  to  be  an  offset  in  regression  estimates 
for  lower  values.  Lower  values  of  moisture  correspond  to  lower  conductivity,  and  as 
was  seen  in  the  simulated  data  experiment,  it  is  difficult  to  estimate  low  values  of 
conductivity  using  these  features. 
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Correlation  of  Features  with  Context  Labels 


Figure  2.11:  Plot  of  correlations  between  features  (horizontal  axis)  and  measured 
soil  properties  (line  color)  for  the  simulated  GPR  experiment.  ©  2012  IEEE. 


Estimating  Dirt  Temperature:  RMSE  =  5.0534 


Estimating  Gravel  Temperature:  RMSE  =  6.0482 


Estimating  Soil  Moisture:  RMSE  =  0.0095015 
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FIGURE  2.12:  Results  of  RVM  regression  to  predict  dirt  temperature  (left),  gravel 
temperature  (center),  and  soil  moisture  (right)  from  contextual  features  extracted 
from  held  data.  The  RMS  error  is  shown  at  the  top  of  each  plot. 
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2.5  Discussion 


Physics-based  features  were  developed  to  provide  a  low-dimensional  representation 
of  GPR  data  that  may  be  useful  for  context  learning.  To  verify  the  efficacy  of  the 
proposed  contextual  features,  experiments  were  performed  using  simulated  and  field- 
collected  GPR  data.  In  these  experiments,  the  proposed  features  were  extracted  from 
a  variety  of  B-scans  collected  over  varying  environmental  conditions,  and  a  RVM 
was  then  applied  to  the  features  for  predicting  the  underlying  soil  properties.  In 
the  simulated  data  experiment,  it  was  shown  that  several  underlying  soil  parameters 
were  predictable  from  the  features.  In  the  field  data  experiment,  the  quantitative 
estimates  of  subsurface  temperature  and  moisture  were  obtained  via  RVM  regression. 

Although  in  both  experiments,  some  soil  properties  were  more  accurately  pre¬ 
dicted  than  others,  in  all  cases  the  proposed  features  were  characteristic  of  major 
differences  in  the  soil  context.  In  context-dependent  learning  for  detecting  buried 
threats  in  GPR  data,  it  may  be  important  for  the  algorithm  to  tell  when  such  ma¬ 
jor  contextual  shifts  take  place.  Given  features  that  are  indicative  of  environmental 
context,  a  statistical  model  could  be  used  to  group  contextually-similar  observations 
into  distinct  clusters.  Then,  a  mixture  of  context-specific  classifiers  could  potentially 
be  learned  as  an  alternative  to  global  classification.  The  next  several  chapters  will 
discuss  several  context  modeling  techniques  that  apply  statistical  mixture  models  for 
clustering  the  features  proposed  in  this  chapter  into  distinct  contexts. 
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Basic  Context  Learning  Techniques 


Although  many  algorithms  have  been  developed  to  automate  buried  threat  detec¬ 
tion  in  GPR  data,  past  comparisons  have  shown  that  certain  algorithms  perform 
best  under  specific  environmental  conditions  [41].  In  the  previous  chapter,  it  was 
shown  that  multiple  environmental  factors  can  be  characterized  from  GPR  data  by 
using  the  proposed  contextual  features.  This  chapter  presents  two  basic  techniques 
for  clustering  these  features  into  distinct  contexts ,  a  process  referred  to  as  context 
learning ,  using  both  supervised  and  unsupervised  techniques.  Each  of  the  learned 
contexts  should  should  be  representative  of  a  unique  set  of  environmental  condi¬ 
tions.  If  supervised  learning  is  employed,  the  contexts  should  correspond  to  known 
contextual  labels  (e.g.  soil  type).  Unsupervised  learning,  however,  may  cluster  the 
features  in  a  more  informative  way.  For  example,  a  broad  category  of  observations 
with  the  context  label  “dirt”  could  potentially  be  clustered  into  many  sub-contexts 
using  unsupervised  learning. 

After  context  learning  is  performed,  unique  classifiers  may  be  trained  on  the  data 
from  each  context.  In  this  work,  an  ensemble  of  relevance  vector  machines  (RVMs) 
are  used  to  perform  context-dependent  algorithm  fusion.  Context-dependent  learning 
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allows  for  the  fusion  weights  of  several  algorithms  to  be  learned  according  to  their 
relative  performance  in  different  environments.  Therefore,  it  would  be  expected 
that  context-dependent  fusion  would  yield  better  overall  target  discrimination  per¬ 
formance  than  a  similar  global  fusion  approach,  which  does  not  incorporate  any 
contextual  information. 

The  following  sections  introduce  basic  techniques  which  have  been  proposed  in 
past  work  for  supervised  [86-88]  and  unsupervised  [89]  approaches  to  context  learn¬ 
ing.  The  RVM  is  also  introduced  as  a  classification  model  that  can  be  implemented  in 
a  context-dependent  learning  framework.  Experimental  results  using  field-collected 
GPR  data  are  presented  to  highlight  the  benefits  and  disadvantages  of  supervised 
and  unsupervised  context  learning,  and  motivates  the  use  of  nonparametric  Bayesian 
methods  for  achieving  additional  performance  improvements. 

3.1  Supervised  Context  Learning 


If  contextual  ground  truth  is  available  for  the  training  data,  a  supervised  approach 
to  context  learning  may  be  used.  For  example,  if  training  data  is  collected  over  M 
several  distinct  soil  types  with  labels  c  =  1,  2, ...,  M,  the  individual  soil  labels  could 
potentially  be  useful  in  learning  the  model  parameters.  A  simple  technique  for  M- 
ary  supervised  clustering  of  the  contextual  features  X(C)  is  a  Gaussian  hypothesis 
test  [86,87].  Using  Bayes’  theorem,  posterior  inference  can  be  performed  by 


p(cn  =  m  lx],60) 


p(xjC)|cn  =  m)p(cn  =  m) 
EjllP(X«C)|Cn  =  j)p(Cn  =  j) 


A/'(xjC)|/2m,  Sm)p(cn  =  m) 
Ejll  N (XnC  !  lAp  %)p{Cn  =  j)  ’ 


(3.1) 


where  /Xm  and  £m  are  the  maximum-likelihood  estimates  of  the  mean  and  variance  of 
X(C)  conditioned  on  context  m.  If  a  uniform  prior  is  assumed,  i.e.  p(cn  =  m)  =  1/M 
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for  m  =  1,  2, M,  (3.1)  can  be  simplified  as 


p(cn  =  |  X '  '  I 


£",V(xf 


(3.2) 


Often  for  visualization  or  interpretation  purposes,  one  may  wish  to  make  “hard” 

(C) 

classifications  of  individual  observations  x„,  .  In  that  case,  individual  points  may  be 
assigned  to  the  maximum  a  priori  (MAP)  class: 

cn  =  argmax  p(cn  =  m|x^)  (3.3) 

m 


If  the  prior  on  c  is  uniform,  (3.3)  simplifies  to 

cn  =  argmax  A/”(x(lCl |£m,  Em).  (3.4) 

m 

Although  supervised  context  modeling  allows  for  characterizing  known  context 
labels  from  the  contextual  features,  obtaining  such  labels  for  training  data  can  be 
very  difficult.  There  may  be  little  apparent  variation  between  soils  over  which  data 
was  collected,  eliminating  the  possibility  of  using  qualitative  labels,  or  equipment 
for  measuring  soil  properties  could  be  too  expensive  or  unavailable.  If  equipment 
is  available,  how  to  properly  threshold  soil  measurements  to  yield  distinct  contexts 
is  still  an  open  question.  In  contrast,  unsupervised  learning  allows  for  clustering 
without  the  need  for  discrete  labels,  eliminating  many  of  these  potential  issues.  A 
basic  technique  for  unsupervised  context  learning  is  presented  in  the  following  section. 


3.2  Unsupervised  Context  Learning 


Several  techniques  for  unsupervised  clustering  exist  for  grouping  together  proximate 
observations  in  a  multidimensional  feature  space  [70-72],  To  draw  a  parallel  to  the 
supervised  context  learning  technique  that  was  described  in  Section  3.1,  consider  a 
Gaussian  mixture  model  (GMM)  as  an  unsupervised  context  model.  Observations 
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drawn  from  a  GMM  come  from  a  weighted  sum  of  M  Gaussian  densities,  each  with 
its  own  mean  nm  and  covariance  Xm,  according  to  the  mixture  proportions  nm.  The 
likelihood  function  of  the  GMM  is  given  by 

M 

P(xiCV,  /R  £)  =  7TrnN{x.(f)\lJ,m,  Sm)-  (3-5) 

m= 1 


The  parameters  of  the  GMM  may  be  learned  several  ways.  Conventionally,  the 
expectation-maximization  (EM)  algorithm  is  used  to  iteratively  maximize  the  likeli¬ 
hood  of  the  data  given  the  parameters  [90].  Alternatively,  Variational  Bayesian  (VB) 
inference  provides  an  alternative  technique  for  learning  the  full  posterior  densities  of 
the  model  parameters  [91].  After  estimating  the  model  parameters  (7fm,  firn  and  STO, 
for  m  —  1,2, ...,  M),  posterior  probabilities  of  the  resulting  contexts  may  be  obtained 
by 


/  I  (C)\  (Xn  |A mi  ^m) 

p(cn  =  m\x>n  0  ^  ..  (g)  ^ 


^•=17TmAA(xk  |A 


(3.6) 


3.3  Within-Context  Target  Classification 


After  contexts  are  learned  in  the  contextual  space  defined  by  features  X(C\  unique 
classifiers  must  be  learned  for  each  context  using  the  target  features  X(i  f .  In  this 
work,  the  relevance  vector  machine  (RVM)  [83, 84]  is  used  as  a  classification  model 
due  to  its  sparseness  properties  and  probabilistic  output.  The  RVM  is  a  Bayesian 
solution  to  inference  for  the  logistic  discriminant  classifier,  given  by 


(3.7) 


PftnWP)  =  |1  -  ^(Sn)]1  , 


(3.8) 


where  w  are  the  D-dimensional  classifier  weights,  tn  is  a  binary  class  label  (0, 1)  for 
observation  n,  cr(-)  denotes  the  logistic  sigmoid  function,  and  </>(•)  denotes  a  kernel 
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transformation.  The  RVM  incorporates  sparseness-promoting  priors  on  w,  given  by 


d=  1,2,. (3.9) 

ad  ~  Gamma  (a0  =  1CT6,  b0  =  1CT6)  .  (3.10) 

Sparseness  is  promoted  in  the  RVM  weights  by  assuming  the  weights  are  statistically 
independent  of  one  another,  and  a  non-informative  Gamma  prior  is  placed  on  the 
precisions  a  governing  the  weights.  By  performing  Bayesian  inference  to  obtain 
a  posteriori  estimates  of  the  weights,  the  values  of  a  tend  to  infinity  for  weights 
corresponding  to  irrelevant  inputs.  This  yields  a  posterior  infinitely  peaked  at  zero, 
and  the  irrelevant  dimensions  in  effectively  receive  a  weight  of  zero.  Those  dimensions 
receiving  nonzero  weight  are  referred  to  as  relevance  vectors.  Details  regarding  RVM 
inference  are  provided  in  Appendix  B. 

In  context-dependent  learning,  classification  is  set  up  as  a  mixture  of  RVMs. 
Extending  (3.7)  and  (3.8)  to  a  mixture  of  classifiers  yields 


Vrim 


p(tn\c 


nm 


wm0  (xf  })T ,  m  =  1,  2, ...,  M, 

(3.11) 

!,xiT))  =  cr(ynm)tn  [1  -  (j{ynrnj\l~tn  , 

(3.12) 

where  cnm  is  a  binary-coded  latent  variable  that  is  equal  to  1  if  the  true  context  of 
x„  is  context  m.  The  latent  variables  are  inferred  from  the  results  of  context  iden¬ 
tification.  If  a  supervised  context  model  is  known,  c  is  not  random  and  the  mixture 
of  RVMs  essentially  can  be  learned  as  M  individual  RVMs  trained  for  each  of  the 
known  contexts.  Otherwise,  c  must  be  treated  probabilistically  and  be  incorporated 
into  the  learning  of  each  wm.  Details  regarding  learning  mixtures  of  RVMs  are  also 
included  in  Appendix  B. 

An  advantage  of  the  RVM  over  other  sparse  kernel  machines,  such  as  the  support 
vector  machine  (SVM)  [92],  is  that  it  yields  probabilistic  outputs  -  i.e.  posterior 
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probabilities  of  tn.  The  probabilistic  RVM  output  can  be  easily  used  in  context- 
dependent  learning  by  integrating  them  over  uncertainty  in  the  underlying  context. 
The  SVM,  by  contrast,  yields  distances  from  the  decision  boundary  that  are  not 
easily  interpretable  in  a  Bayesian  framework.  Another  advantage  of  the  RVM  is  that 
(j)  need  not  be  a  kernel  function  that  satisfies  Mercer’s  conditions  [70,83].  Therefore, 
the  direct  kernel,  i.e.  0(x^)  =  [1,  xf]7^]  may  be  used  in  training  the  RVM.  The 
effect  of  using  a  direct  kernel  is  that  irrelevant  features  will  receive  zero  weight,  so 
training  a  direct-kernel  RVM  is  therefore  a  de  facto  method  for  feature  selection.  In 
context-dependent  learning,  using  direct-kernel  RVMs  provides  an  intuitive  way  for 
performing  context-dependent  feature  selection ;  features  that  are  relevant  for  classi¬ 
fication  in  a  particular  context  will  receive  nonzero  weight  from  the  classifier  trained 
for  that  context  [86-89]. 

After  training  the  (supervised  or  unsupervised)  context  model  on  the  contextual 
features  X(<D  and  the  mixture  of  RVMs  on  the  target  features  X(i\  the  outputs 
of  both  must  be  combined  to  yield  a  posterior  probability  of  an  observation  being 
a  target.  This  is  accomplished  by  integrating  the  within- context  target  posteriors 
obtained  from  the  RVM,  p(tn\cnm  =  1, xt',T)),  over  the  context  posteriors  obtained 

from  the  target  model,  p  (cnm  =  llx^j: 

M 

p  (i7i|x^C))  =  ^2  P  {tn\Cnm  =  1,  X^})  p  (cnm  =  l|x^C))  (3.13) 

m=  1 

//'■n 

The  resulting  posterior  probability,  p(7/i|x„  ;),  may  then  be  thresholded  for  the  pur¬ 
pose  of  making  hard  decisions.  Overall  performance  may  be  measured  by  evaluating 
the  probability  of  detection  (PD)  and  false  alarm  rate  (FAR)  as  a  function  of  the 
decision  threshold,  and  plotting  the  receiver  operating  characteristic  (ROC)  curve. 
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Table  3.1:  Alarm  Distribution  by  Soil  Type  and  Ground  Truth 


Soil 

Clutter  (%) 

Targets  (%) 

Total  (%) 

Dirt 

9,356  (72.7%) 

933  (54.3%) 

10,289  (70.5%) 

Gravel 

2,658  (20.6%) 

393  (22.9%) 

3,051  (20.9%) 

Asphalt 

245  (1.9%) 

212  (12.4%) 

457  (3.1%) 

Concrete 

620  (4.8%) 

178  (10.4%) 

798  (5.5%) 

ALL 

12,879  (100%) 

1,716  (100%) 

14,595  (100%) 

3.4  GPR  Data  for  Evaluating  Landmine/IED  Detection  Performance 

Both  techniques  for  basic  context-dependent  learning  were  evaluated  on  a  large  set 
of  GPR  data  collected  between  2009-2010  at  two  different  government  test  sites  in 
the  continental  U.S.  One  site  was  located  in  an  arid  region  of  the  Southwestern  U.S., 
and  the  other  site  was  located  in  a  temperate  region  of  the  Eastern  U.S..  Data  was 
collected  with  the  NIITEK  GPR  over  prepared  dirt,  gravel,  asphalt,  and  concrete 
lanes  with  emplaced  targets  and  clutter  objects.  The  targets  included  10  different 
types  of  AT  landmines  with  varied  metal  content,  155mm  artillery  shells,  and  several 
IED  targets  consisting  of  a  pressure  plate,  main  charge,  and  command  wire.  Several 
metal  and  nonmetal  clutter  objects,  including  empty  holes,  were  also  considered  as 
potential  false  alarm  sources.  The  GPR  made  several  passes  down  each  test  lane  to 
ensure  the  entire  area  was  covered,  yielding  a  total  of  171  target  encounters  and  524 
clutter  encounters  over  a  total  collection  area  of  92,340  m2. 

A  derivative  of  the  LMS  prescreener  [38]  was  run  offline  on  the  GPR  data,  and 
detected  a  total  of  14,595  anomalies.  These  locations  are  referred  to  as  alarms, 
and  are  passed  to  feature-based  algorithms  for  classification  as  targets  or  clutter. 
Table  6.1  illustrates  the  distribution  of  prescreener  alarms  across  the  four  types  of 
lane  construction  (referred  to  henceforth  as  “soils”): 

Contextual  features  were  extracted  from  a  512  x  100  B-scan  from  the  same  channel 
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Known  Soil  Labels 


Supervised  Context  Learning 


76.3549%  Correct 


[10289] 


[3051] 


Asphalt 


Dirt  Gravel  Asphalt  Concrete 
Response 


FIGURE  3.1:  Left:  Scatter  plot  of  3-D  PCA  projection  of  contextual  features,  with 
points  colored  by  qualitative  soil  label.  Center:  Same  scatter  plot,  but  with  points 
colored  by  MAP  supervised  context.  Right:  Confusion  matrix  illustrating  overall 
performance  of  supervised  context  learning,  evaluated  by  10-fold  cross-validation. 


of  each  alarm  consisting  of  the  previous  background  data.  Furthermore,  the  edge 
histogram  descriptor  (EHD)  [44],  the  spectral  correlation  features  (SCF)  [49],  and  the 
hidden  Markov  model  (HMM)  [42]  algorithms  were  run  on  the  anomalous  responses 
to  yield  confidence  values  for  each  alarm.  In  performing  algorithm  fusion,  the  target 
feature  vector  consisted  of  the  prescreener,  EHD,  SCF,  and  HMM  confidence 
values.  Unless  otherwise  noted,  all  evaluations  on  GPR  data  that  are  presented  in 
this  dissertation  were  performed  on  this  data  set. 

3.5  Experimental  Results 

3.5.1  Supervised  Context  Learning 

Figure  3.1  illustrates  supervised  modeling  being  performed  on  the  3-D  principal  com¬ 
ponents  analysis  (PCA)  projection  of  the  23-dimensional  contextual  GPR  features 
discussed  in  the  previous  chapter.  Because  this  data  was  collected  over  four  soil  types, 
one  may  want  to  leverage  that  prior  contextual  information  and  train  a  supervised 
context  model  to  infer  the  soil  type  from  the  background  features. 

In  Figure  3.1,  the  leftmost  plot  illustrates  the  scatter  of  the  projected  features 


colored  according  to  the  known  soil  labels.  The  points  tend  to  cluster  according  to 
soil  type,  which  was  expected  since  the  previous  chapter  illustrated  their  efficacy 
in  characterizing  multiple  soil  properties.  The  center  plot  illustrates  the  supervised 
classification  result  of  each  point,  with  each  point  colored  according  to  the  MAP 
class  determined  by  the  Gaussian  hypothesis  test.  The  results  are  summarized  in 
the  confusion  matrix  at  right.  Overall,  76.4%  of  observations’  contexts  were  identi¬ 
fied  correctly.  However,  the  misclassifications  show  some  interesting  results.  Data 
collected  over  dirt  and  gravel  are  often  confused  with  one  another,  as  are  asphalt 
and  concrete.  This  result  suggests  a  degree  of  commonality  between  these  pairs  of 
contexts.  Additionally,  gravel  is  confused  with  dirt  much  more  often  than  dirt  is  mis¬ 
taken  for  gravel.  This  result,  coupled  with  the  fact  that  there  are  over  three  times 
as  many  dirt  observations  than  gravel,  suggests  that  perhaps  the  dirt  context  could 
potentially  be  sub-divided  into  several  smaller  sub-contexts  with  distinct  properties. 

The  advantages  and  disadvantages  of  supervised  context  modeling  are  clearly  il¬ 
lustrated  by  these  results.  Depending  on  the  labels  being  used,  supervised  learning 
can  be  an  easy  way  to  verify  that  the  contextual  features  are  indicative  of  underly¬ 
ing  environmental  factors.  However,  this  is  only  true  if  the  labels  are  relevant.  If 
the  labels  are  irrelevant  or  redundant,  supervised  context  learning  may  be  forced 
to  differentiate  between  labeled  contexts  that  have  similar  or  no  impact  on  sensor 
performance.  Conversely,  if  the  labels  are  too  broad,  the  resulting  clusters  may  not 
be  indicative  of  underlying  contextual  factors. 

3.5.2  Unsupervised  Context  Learning 

Figure  3.2  illustrates  examples  of  unsupervised  context  learning  performed  on  the 
same  PCA-projected  features  as  in  Figure  3.1.  The  top  two  plots  illustrate  the  result 
of  training  a  3-component  GMM.  As  shown  by  the  scatterplot  at  top-left,  the  GMM 
converged  to  one  large  Gaussian  cluster  and  two  smaller  ones.  Two  contexts  are 


59 


primarily  composed  of  dirt  and  gravel  points,  and  the  third  context  is  spread  across 
all  four  soil  types.  In  contrast,  consider  the  bottom  two  plots  of  Figure  3.2,  which 
consider  training  an  8-component  GMM  on  the  PCA-projected  features.  In  this  case, 
the  asphalt  and  concrete  data  are  assigned  to  different  contexts;  concrete  data  are 
mostly  assigned  to  Context  2,  and  asphalt  data  are  mostly  assigned  to  Context  7. 
The  remaining  contexts  are  split  between  dirt  and  gravel  data. 

The  differences  between  the  3-component  GMM  and  the  8-component  GMM 
illustrate  that  the  performance  of  unsupervised  context  learning  can  be  substantially 
affected  by  the  order  of  the  model.  Although  the  results  obtained  from  the  various 
clusterings  can  be  interpreted  in  a  variety  of  ways,  they  may  not  necessarily  be 
indicative  of  the  underlying  phenomenology.  For  example,  asphalt  and  concrete  were 
grouped  together  by  the  3-component  GMM  and  discriminated  by  the  8-component 
GMM.  Although  an  argument  could  be  made  that  both  are  paved  roads,  one  could 
also  argue  that  each  may  constitute  a  unique  propagation  environment.  Therefore,  it 
is  important  for  the  model  order  to  be  selected  carefully;  If  M  is  too  small,  the  model 
will  be  too  simple  and  could  be  under-trained ,  and  if  M  is  too  large,  the  model  will 
be  too  complex  and  may  run  the  risk  of  over-training.  This  dilemma  is  addressed  by 
the  nonparametric  Bayesian  learning  techniques  that  are  proposed  in  the  following 
chapters. 

3.5.3  Context-Dependent  Fusion  Results 

The  following  examples  illustrate  the  results  obtained  from  training  RVMs  for  al¬ 
gorithm  fusion  using  the  contextual  information  obtained  through  supervised  and 
unsupervised  context  modeling.  In  these  examples,  RVMs  were  trained  on  the  differ¬ 
ent  confidence  values  obtained  for  each  alarm  that  was  flagged  in  the  data  set.  The 
prescreener,  EHD,  SPSCF,  and  HMM  algorithms  utilize  complementary  information 
and  it  has  been  shown  that  algorithm  fusion  aids  in  performance  [41].  Figure  3.4 
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Unsupervised  Context  Learning:  3-Comp.  GMM 


Similarity  Matrix:  Soil  Labels  and  3-Comp.  GMM 
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FIGURE  3.2:  Top-Left:  Scatter  plot  of  3-D  PCA  projection  of  contextual  features, 
with  points  colored  by  MAP  context  determined  by  a  3-component  GMM.  Top-Right: 
Similarity  matrix  comparing  the  makeup  of  the  3  unsupervised  contexts  to  the  known 
soil  labels.  Bottom-Left:  Scatter  plot  of  3-D  PCA  projection  of  contextual  features, 
with  points  colored  by  MAP  context  determined  by  an  8-component  GMM.  Bottom- 
Right:  Similarity  matrix  comparing  the  makeup  of  the  8  unsupervised  contexts  to 
the  known  soil  labels. 
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RVM  Discriminant  Weights  for  Supervised  Contexts 
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FIGURE  3.3:  RVM  discriminant  weights  learned  for  algorithm  fusion  in  each  su¬ 
pervised  context.  Each  stem  represents  a  particular  dimension  of  the  target  feature 
space,  the  vertical  axis  represents  the  weight  value,  and  soil  contexts  are  indicated 
by  line  color. 


illustrates  the  discriminant  weights  obtained  by  the  RVMs  for  algorithm  fusion  in 
the  labeled  dirt,  gravel,  asphalt,  and  concrete  soil  contexts.  Each  context,  illustrated 
by  the  different  colors  of  lines,  requires  a  unique  weighting  of  the  four  algorithms’ 
confidences.  This  result  suggests  that  the  contextual  labels  are  relevant,  otherwise 
the  weighting  would  be  the  same  across  all  four  contexts.  Because  RVMs  are  being 
used  to  learn  the  discriminant  weights,  a  unique  subset  of  the  algorithms  are  selected 
as  relevant  for  each  context  while  irrelevant  algorithms  are  completely  ignored,  ft 
also  appears  that  other  than  the  prescreener,  no  one  algorithm  is  universally  relevant 
since  each  of  the  three  feature-based  algorithms  receives  zero  weight  in  at  least  one 
context. 

Figure  3.4  shows  similar  results,  but  with  unsupervised  context  modeling.  The 
top  plot  illustrates  the  RVM  weights  obtained  for  each  of  the  3  unsupervised  contexts, 
and  the  bottom  plot  illustrates  the  weights  obtained  for  8  contexts.  Interpretation  of 
these  results  can  be  very  difficult  and  requires  experimenting  with  different  orders  of 
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RVM  Discriminant  Weights  for  3  Unsupervised  Contexts 
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FIGURE  3.4:  RVM  discriminant  weights  learned  for  algorithm  fusion  in  both  3  (top) 
and  8  (bottom)  unsupervised  contexts.  Each  stem  represents  a  particular  dimension 
of  the  target  feature  space,  the  vertical  axis  represents  the  weight  value,  and  soil 
contexts  are  indicated  by  line  color. 


context  models  and  evaluating  performance  for  each  case.  As  shown  in  the  3-context 
case,  SPSCF  is  the  only  algorithm  to  ever  receive  zero  weight,  and  does  so  in  2  of  the 
3  contexts,  while  all  other  algorithms  receive  nonzero  weight  in  all  contexts.  In  the 
8-context  case,  the  prescreener,  EHD,  and  SPSCF  are  all  irrelevant  in  at  least  one 
context  each,  and  the  HMM  appears  to  be  relevant  in  all  8  contexts.  For  both  of  these 
cases,  the  weights  are  difficult  to  interpret  and  performance  may  be  better-evaluated 
from  the  ROC  curves. 

3.5.4  Detection  Performance 

Classification  performance  was  evaluated  using  10-fold  cross-validation  over  emplaced 
objects ,  rather  than  alarms,  to  ensure  that  training  and  testing  did  not  occur  on  dif¬ 
ferent  observations  of  the  same  object.  Multiple  alarms  on  the  same  object  were 
consolidated  in  scoring  by  taking  the  maximum  of  all  alarm  confidences  over  a  single 
pass  registered  within  a  radius  of  0.25  m  from  an  object’s  center.  Scoring  was  per- 
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formed  using  the  Mine  Detection  Algorithm  Scoring  (MIDAS)  tool  provided  by  the 
Institute  for  Defense  Analyses  [93]. 

Figure  3.5  illustrates  the  ROC  curves  for  basic  context-dependent  fusion,  using 
both  supervised  and  unsupervised  context  models,  and  compares  performance  to  a 
globally-implemented  RVM  that  incorporates  no  contextual  information.  The  FAR 
for  benchmark  PDs  of  0.85,  0.90,  and  0.95  for  each  algorithm  are  shown  in  the  legend. 
The  context-dependent  techniques,  plotted  as  solid  lines,  illustrate  varying  degrees 
of  improvement  over  the  RVM,  the  performance  of  which  the  90%  confidence  region 
is  shaded.  Somewhat  surprisingly,  supervised  context  learning  yielded  little  improve¬ 
ment  to  performance.  The  ROC  for  context-dependent  fusion  with  the  supervised 
context  model  shows  lower  FAR  than  the  RVM  at  low  PD  levels  (<  0.65),  but  at 
high  PD  (>  0.85)  the  performance  is  essentially  the  same.  This  result  suggests  that 
perhaps  the  soil  labels  that  were  used  are  not  reflective  of  the  true  contextual  factors 
in  this  problem. 

Meanwhile,  unsupervised  context  learning  appears  to  yield  more  useful  contextual 
information.  However,  the  degree  of  improvement  is  dependent  on  the  order  of  the 
context  model.  If  the  model  order  is  chosen  correctly,  significant  improvements 
over  the  single  RVM  are  possible  at  high  PD.  These  results  suggest  that  although 
unsupervised  context  modeling  has  the  potential  to  leverage  contextual  information 
that  is  beyond  qualitative  context  labels,  performance  is  highly  dependent  on  the 
context  model  order  which  must  be  determined  experimentally. 

3.6  Discussion 

In  this  chapter,  basic  techniques  for  context-modeling  and  context-dependent  fusion 
were  introduced.  Supervised  and  unsupervised  techniques  were  proposed  for  model¬ 
ing  context  distributions  in  the  features  X(CA  Evaluation  of  the  supervised  context 
model  yielded  intuitive  results,  and  the  interpretation  of  the  unsupervised  context 
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FIGURE  3.5:  ROC  curves  for  basic  context-dependent  fusion  techniques,  compared 
to  non-context-dependent  RVM  fusion  (black  dashed)  and  the  individual  fused  algo¬ 
rithms  (dotted).  The  ROC  consists  of  PD  versus  FAR,  measured  in  false  alarms  per 
square  meter,  as  a  function  of  decision  threshold. 


model’s  behavior  was  dependent  on  the  model  order.  Relevance  vector  machines 
were  also  introduced  for  the  purpose  of  training  context-specific  classifiers  on  the 
target  features  X(7).  The  choice  of  using  a  supervised  or  unsupervised  context  model 
appeared  to  have  substantial  impact  on  RVM  training  and  overall  model  behavior. 
Finally,  performance  of  context-dependent  algorithm  fusion  was  evaluated  on  a  large, 
geographically- diverse  GPR  data  set  consisting  of  landmine  and  IED  signatures  and 
many  false  alarms.  The  potential  for  context-dependent  fusion  to  improve  upon  the 
performance  of  non-context-dependent  RVM  fusion  was  illustrated,  although  the  de- 
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gree  of  performance  improvement  depends  on  the  specific  model  being  used.  While 
unsupervised  context  modeling  led  to  the  greatest  performance  improvements,  the 
degree  of  improvement  was  highly  dependent  on  the  model  order,  i.e.  the  number  of 
contexts  being  considered. 

These  results  clearly  illustrate  that  although  unsupervised  context  learning  may 
be  advantageous,  conventional  techniques  for  clustering  that  depend  on  prior  knowl¬ 
edge  of  the  model  order,  M,  may  be  prone  to  over-  or  under-training.  If  M  is  too 
small,  too  few  unique  contexts  will  be  learned,  which  results  in  an  under-trained 
model.  If  M  is  too  large,  too  many  contexts  will  be  learned,  resulting  in  an  over¬ 
trained  model.  Furthermore,  parametric  models  such  as  GMMs  are  difficult  to  imple¬ 
ment  in  high- dimensional  spaces  due  to  the  oft-cited  curse  of  dimensionality  [70-72], 
hence  the  use  of  PCA  in  projecting  the  23-D  context  features  to  3-D.  Rather  than 
set  the  order  of  the  context  model  experimentally  by  evaluating  performance  with 
different  numbers  of  contexts,  it  may  be  preferable  for  an  algorithm  to  learn  the 
optimal  number  of  contexts  automatically.  Likewise,  it  may  also  be  preferable  to 
use  all  of  the  available  contextual  features  rather  than  potentially  sacrifice  informa¬ 
tion  through  dimensionality  reduction.  These  items  are  addressed  in  remainder  of 
this  dissertation,  which  proposes  Bayesian  inference  for  nonparametric  context  mod¬ 
els  that  facilitate  learning  of  the  optimal  model  order  and  the  discovery  of  latent 
features  in  high-dimensional  spaces. 
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4 


Generative  Nonparametric  Context  Learning 


The  previous  chapter  illustrated  that  unsupervised  context  learning  has  potential 
benefits  over  supervised  learning.  Contexts  learned  from  an  unsupervised  model  may 
be  more  informative  than  subjective  context  labels,  which  may  yield  improvements 
to  overall  detection  performance.  However,  unsupervised  context  learning  is  also 
a  problem  of  model  order  selection,  which  translates  to  specifying  the  number  of 
contexts  to  learn.  Learning  too  many  or  too  few  contexts  may  run  the  risk  of  over- 
or  under-training. 

An  alternative  to  specifying  the  model  order  is  to  use  a  nonparametric  mixture 
model  that  facilitates  learning  an  effective  number  of  mixture  components.  In  this 
chapter,  two  nonparametric  context  models  are  proposed.  The  first  model  was  based 
on  the  Dirichlet  Process  Gaussian  Mixture  Model  (DPGMM),  originally  published 
by  Blei  and  Jordan  [67].  The  DPGMM  consists  of  an  infinite-order  GMM  with  a 
sparseness-promoting  Dirichlet  process  (DP)  prior,  and  is  useful  for  clustering  when 
the  number  of  clusters  is  unknown  but  can  be  learned  from  the  data. 

It  is  possible  that  some  contexts  may  be  characterized  by  different  contextual 
factors,  and  some  contexts  may  require  more  or  less  information  to  distinguish  them 
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from  others.  The  second  context  model  proposed  in  this  chapter,  the  DP  Mixture 
of  Factor  Analyzers  (DPMFA),  is  motivated  by  this  hypothesis.  Like  the  DPGMM, 
the  DPMFA  can  be  used  to  learn  the  number  of  clusters  present  in  a  data  set  as  well 
as  the  latent  features  describing  each  cluster.  The  DPMFA  model  used  in  this  work 
was  originally  proposed  by  Wang  et  al.  [94], 

Both  nonparametric  context  models  proposed  in  this  chapter  were  trained  using 
a  generative  learning.  In  other  words,  the  context  models  were  trained  on  the  con¬ 
textual  features  only,  without  regard  to  the  target  features  and  target/clutter  labels 
for  each  observation.  Both  models  were  learned  using  variational  Bayesian  (VB) 
inference. 

The  following  sections  introduce  the  concept  of  VB  inference,  nonparametric 
models  and  the  DP.  Both  nonparametric  context  models  are  then  introduced  through 
synthetic  data  examples.  Finally,  experimental  results  are  presented  to  compare  the 
merits  of  using  these  models  in  context-dependent  algorithm  fusion  for  buried  threat 
detection  with  GPR. 

4.1  Bayesian  Inference  and  Variational  Learning 

4-1.1  Point  Estimation  of  Model  Parameters 

Robust  parameter  estimation  is  particularly  important  in  unsupervised  learning, 
since  labels  cannot  be  used  to  verify  the  accuracy  of  the  model.  As  was  alluded 
to  with  the  GMM  presented  in  Section  3.2,  conventional  parameter  estimation  tech¬ 
niques  yield  point  estimates  of  model  parameters.  The  most  common  method  for 
parameter  estimation  is  maximum-likelihood  (ML).  The  ML  estimates,  ©A/L,  of  the 
model  parameters  ©  are  found  by  maximizing  the  likelihood  of  the  training  data  X, 
given  by: 

@ML  =  argmax  p  (X|@)  (4.1) 

© 
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In  some  cases,  ML  estimates  can  be  found  analytically  (e.g.,  estimating  the  mean 
and  variance  of  a  Gaussian  distribution)  and  in  other  cases  estimates  must  be  found 
iteratively  (e.g.,  the  EM  algorithm  applied  to  GMMs)  [90].  A  common  criticism 
of  ML  parameter  estimation  is  that  it  is  prone  to  over-fitting  [71,95].  Therefore, 
an  alternative  to  ML  is  maximum  a  posteriori  (MAP)  parameter  estimation.  MAP 
estimation  is  less  prone  to  over-fitting  because  prior  information  is  used  to  regularize 
inference.  This  can  be  seen  in  Bayes’  theorem, 


p(0|X) 


p(X|e)p(e) 

P(X) 


p(x|e)p(e) 

/eP(X|©)p(0)<(0’ 


(4.2) 


where  p(X|0)  is  referred  to  as  the  likelihood,  p(0)  is  the  prior ,  and  p(X)  is  the  evi¬ 
dence.  The  MAP  estimate  is  obtained  by  maximizing  the  posterior  density,  p  (0|X): 


@MAP  =  argmax  p  (@|X) 
© 


(4.3) 


However,  @MAP  is  still  a  point  estimate,  effectively  approximating  the  posterior 
uncertainty  as  a  Dirac  delta  function,  when  a  full  posterior  density  may  be  desired 
for  some  applications.  Calculating  the  posterior  density  involves  solving  to  obtain 
the  functional  form  of  p(0|X).  This  procedure  is  known  as  Bayesian  inference. 


4-1.2  Bayesian  Inference 


According  to  (4.2),  the  only  information  required  to  solve  for  the  posterior  density 
are  the  likelihood  and  prior  densities.  The  likelihood  is  obtained  from  the  statistical 
model  chosen  for  the  problem,  the  prior  density  expresses  uncertainty  in  the  param¬ 
eters’  values,  and  the  evidence  is  effectively  a  normalizing  constant.  In  many  cases, 
the  evidence  integral  may  be  difficult  to  compute.  A  common  technique  for  circum¬ 
venting  these  potential  issues  is  by  assuming  a  conjugate  prior  distribution  [95,96]. 

A  conjugate  prior  is  defined  as  a  distribution  that,  when  paired  with  a  particular 
model,  yields  a  posterior  distribution  with  the  same  functional  form.  The  benefit 
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of  using  conjugate  priors  is  that  the  parameters  of  the  posterior  density  can  be 
calculated  as  a  function  of  the  prior  density’s  parameters  (known  as  hyperparameters ) 
and  the  data.  Since  the  posterior  is  known  to  be  of  the  same  form  as  the  observation 
model,  the  full  posterior  density  can  be  determined  exactly  by  simply  updating  its 
parameters. 

For  a  simple  example,  consider  a  Bernoulli  process  as  the  observation  model. 
Under  this  model,  x  =  (0, 1)  with  p(x  =  1)  =  6  and  p(x  =  0)  =  1  —  9.  The  likelihood 
function  of  n  successes  ( x  =  1)  and  m  failures  (x  =  0)  is  therefore 

77  t 

p(n,m\0)  =  u  '  Jn(l  -  0)m  (4-4) 

m\[n  —  my. 

For  a  Bernoulli  model,  the  corresponding  conjugate  prior  is  the  Beta  density  with 
hyperparameters  a  and  f3  given  by 


p(6) 


pfcsvr'ii-r'-'.o^Ki 

0,  otherwise. 


(4.5) 


Solving  for  the  posterior  density  yields 


p(9\n,  m) 


p(n,  m\9)p(9 ) 
Jep(n,m\0)p(9)d9 


\a-0-l 


, ,n!  jn(i  -  e)mm  -  o) 

m\{n—m)\  ^  '  (p— l)!(a— p— 1)!  v  ' 

Jo1  -  er-e-'M 


_ n!(o:— 1)! _ nn+a— l/i  n\n+a—m—0—l 
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(4.6) 


Note  that  the  integrand  in  the  denominator  of  (4.6)  is  Beta{n  +  a,  m  +  f3),  and 
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therefore  integrates  to  1.  Simplifying  (4.6)  therefore  yields 


p(0\n,m)  = 


(n  +  a  —  1)! 


(m  +  —  l)!(n  +  a  —  m  —  /3  —  1)! 
=  Beta(n  +  a,m  +  /3). 


Qn+a— 1/^  Q^n+a—m— (3—1 


(4.7) 


In  this  example,  the  posterior  is  simply  another  Beta  distribution,  as  was  the 
prior.  The  posterior  parameters  are  also  a  a  function  of  the  hyperparameters.  There¬ 
fore,  rather  than  solving  Bayes’  theorem  explicitly,  the  posterior  can  be  expressed  in 
terms  of  the  updated  parameters.  If  Bayes’  theorem  is  being  applied  sequentially, 
updating  the  posterior  as  more  data  is  observed,  the  posterior  becomes  the  prior 
for  the  next  iteration.  By  exploiting  conjugate  priors,  Bayesian  inference  can  pro¬ 
vide  a  computationally-efhcient  technique  for  obtaining  a  full  expression  of  posterior 
uncertainty. 

Although  conjugate  priors  are  chosen  to  facilitate  mathematical  tractability,  they 
should  still  reflect  a  priori  information  about  the  data.  Fortunately,  many  conju¬ 
gate  priors  offer  a  wide  range  of  uncertainty  expression  through  various  parameter 
settings,  including  settings  that  represent  little  or  no  prior  information,  such  as 
Beta(l,l)  or  a  Gaussian  with  large  variance.  It  is  important  in  any  Bayesian  infer¬ 
ence  problem  to  use  a  prior  that  makes  sense  for  the  problem,  and  choosing  param¬ 
eters  that  allow  for  controlled  regularization,  since  in  many  cases  certain  parameter 
settings  may  yield  unexpected  over-  or  under-regularization. 

Conjugate  priors  are  useful  for  determining  the  full  posterior  uncertainty  in  pa¬ 
rameters  of  canonical  distributions,  but  complex  models  often  do  not  lend  to  a  fully- 
conjugate  solution.  Often,  these  types  of  models  involve  latent  variables;  examples 
include  the  GMM  and  HMM,  from  which  draws  are  conditioned  on  a  finite  mixture 
of  component  densities.  Although  point  estimates  of  model  parameters  could  still  be 
obtained  via  iterative  techniques  such  as  such  as  the  EM  algorithm  [70-72,90],  the 
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same  concerns  regarding  over-fitting  discussed  previously  still  apply.  Alternatively, 
one  could  seek  to  approximate  the  posterior  by  making  certain  assumptions  about 
the  a  priori  dependence  of  the  model  parameters.  One  technique  for  estimating  the 
posterior,  which  has  roots  in  statistical  physics,  is  variational  inference. 

4-1.3  Variational  Bayesian  Inference 

Variational  (VB)  Bayesian  inference  is  a  technique  for  approximate  posterior  infer¬ 
ence  in  many  problems  where  the  evidence  integral  is  intractable  [67,68,71,84,91,97- 
101].  Since  the  evidence  integral  cannot  be  computed  directly,  variational  inference 
is  used  to  maximize  a  lower  bound  on  it.  Using  this  approximation  for  the  evidence, 
a  subsequent  approximation  to  the  posterior  can  be  calculated  and  is  referred  to  as 
the  variational  posterior : 

g(Z)=p(0|X)  (4.8) 

To  determine  the  lower  bound  on  the  evidence  integral,  first  rewrite  the  evidence  as 


P(X) 


P(X,Q) 

p(©|X) 

p(X,0)g(0) 

p(0|X)g(0)- 


(4.9) 


Taking  the  logarithm  of  both  sides  of  (4.9)  yields 


iogb(x)] 


log  [p(X)]  q(Q)dG 


-p(X,@)q@y 

j>(Z\X)q(@)_ 


q(G)d& 


>(X,0)~ 
.  9(0)  . 


q(@)d@  +  f  log 
J  z 


90) 

.P(0|X)_ 


q(G)dG 


J-[g(0)]  +  KLD[g(0)||p(0|X)]  (4.10) 
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The  first  term  of  (4.10)  is  the  negative  of  what  is  known  in  statistical  physics  as  free 
energy  [102],  and  is  therefore  referred  to  here  as  negative  free  energy  (NFE)  J7^©)]: 

-%(»)]  =  f  log  [^pl  (4.11) 

The  second  term  of  (4.10)  is  the  Kullback-Leibler  divergence  (KLD)  between  the 
variational  posterior,  q(&),  and  the  true  posterior,  p(0|X).  The  KLD  is  a  dis¬ 
tance  metric  between  two  probability  densities,  and  by  definition  is  always  positive. 
Therefore,  rearranging  (4.10)  illustrates  that  the  NFE  is  a  true  lower  bound  on  the 
log-evidence  [99]: 

T[q(@)}  =  log[p(X)]  -  KLD[g(0)||p(0|X)]  (4.12) 

The  NFE  thereby  serves  as  the  objective  function  for  the  VB  optimization  problem 
since  maximizing  «F[g(©)]  also  maximizes  the  lower  bound  on  the  evidence  integral. 
However,  the  true  posterior  cannot  be  calculated  explicitly  (hence  the  purpose  of 
VB).  A  calculable  definition  of  NFE  can  be  obtained  by  rewriting  the  numerator  of 
the  fraction  term  in  (4.11)  as  p(X|©)p(@).  This  yields  an  expression  of  the  NFE 
as  the  difference  between  the  expected  log- likelihood  (with  respect  to  the  variational 
posterior)  and  the  KLD  between  the  variational  posterior  and  the  prior: 

^(0)]  =  E  [logp(X|0)]  -  KLD [<?(©) |  |p(0)]  (4.13) 

Maximizing  the  NFE  is  generally  a  high-dimensional  optimization  problem,  due 
to  the  dimensionality  of  X  and  the  number  of  free  parameters.  To  facilitate  the 
optimization  procedure,  q(@)  is  generally  restricted  to  a  factorized  density  given  by 

9(0)=n«(e<)'  <4-14) 

i 

This  approach  is  referred  to  as  the  mean-field  approximation  [98].  The  factorized 
density  given  by  (4.14)  restricts  inference  by  implicitly  assuming  that  ©  can  be  par¬ 
titioned  into  disjoint  groups  of  statistically  independent  parameters  (indexed  by  i). 
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Note  that  there  is  no  restriction  on  the  functional  forms  of  the  individual  The 

factorized  density  allows  for  the  optimization  of  the  NFE  with  respect  to  one  param¬ 
eter  at  a  time.  This  can  be  illustrated  by  first  substituting  (4.14)  into  (4.11).  To 
further  consolidate  notation,  let  q,:  =  q{0t),  q~i  =  j^Qj-  The  following  derivation 
is  based  on  that  found  in  [71]: 


F{q)  =  /  9(0)  log 


p(X.B) 
9(0)  J 


cie 


n« 


l°gp(X,  ©)  -  J^logg* 


dG 


n« 


logp(X,  Q)dQ  - 


n ' 


Y  log  qdG 


q-ilogp(X,Q)d&^i 


d&i  -  /  q-i 


q;  log  q,d(); 


cZ@_,; 


-  /  qi 


q-i  log  q-idQ-i 


dG, 


(4.15) 


The  second  and  third  terms  of  (4.15)  can  be  consolidated  by  noting  that  J  qidQi 
f  q-idG-i  =  1,  since  q  must  be  a  valid  PDF.  This  yields 

F{q)  =  j  qi  I  <?-ilogp(X,  @)d&_i  d@i  J  q{  log  qidQi  -  J  q^  log  q-idQ^ 


q-ilogp(X,@)dQ^i 


d@i  -  /  qi  log  qidQi  -  El  [g_,( 


q^q_,  [log  p(X,G)]dGi  -  /  qi  log  qidQi  -  El  [q_{ 


q,llogp(K,Q)d@l  -  I  qi  log  qidQi, 


(4.16) 


where  El  [q_i]  indicates  the  entropy  operator  applied  to  q _j.  Here,  Eq_;  [logp(X,  ©)]  is 
defined  as  the  expected  joint  log-density  of  X  and  ©,  with  respect  to  the  variational 
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density  q_t.  The  new  density,  p(X,  ©),  is  given  by 


l°gp(X,  0)  =  Eq_.  [log p(X,  ©)]  -  H  [q_i] 
~(X  ©j  =  ®xP{E<?-i  [l°gp(X,  ©)]} 


(4.17) 

(4.18) 


exp{H  [q-i]} 

One  may  recognize  (4.16)  as  the  negative  KLD  between  q,  and  p(X,  ©).  Therefore, 
J-'(q)  will  be  maximized  when  K LD[qi\  |p(X,  ©)]  is  minimized,  which  occurs  when 
the  two  densities  are  equal.  The  variational  density  of  parameter  0,  that  maximizes 
the  NFE  is  then  given  by 


log  Qi  =  £y_Jlogp(X,  ©)]  -  H  [q-i] 

exp{Eq_,  [logp(X,  ©)]} 


Qi  = 


(4.19) 

(4.20) 


exp{EI  [g_j]} 

An  interesting  observation  is  that  (4.20)  is  very  similar  to  Bayes’  theorem,  except 
that  it  involves  expectations  of  p(X,  ©).  By  using  (4.19)  and  (4.20),  an  iterative 
process  for  optimizing  the  NFE  follows.  In  a  process  very  similar  to  the  EM  algo¬ 
rithm,  the  variational  posterior  of  each  parameter  is  updated  by  using  expectations 
computed  with  respect  to  the  other  parameters.  Since  the  bound  to  log-evidence 
is  convex,  each  iteration  is  guaranteed  to  increase  the  NFE  [98].  When  all  of  the 
parameters  are  updated,  the  NFE  may  be  re-evaluated  and  updates  continue  until 
it  converges,  defined  as  increasing  less  than  a  predetermined  amount. 


4.2  Dirichlet  Process 


The  Dirichlet  process  (DP)  is  a  common  choice  of  prior  density  for  nonparametric 
mixture  models  that  facilitates  learning  of  latent  variables  [103].  The  DP  has  been 
described  as  a  distribution  of  distributions  that  is  governed  by  the  scaling  parameter 
a  and  a  base  distribution  G0 : 

G~VV(G0,a)  (4.21) 
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Random  draws  (0n,  n  =  1,2,...)  from  G  exhibit  clustering  properties  described 
by  a  Polya  urn  scheme  [104],  which  implies  that  some  of  the  0,;  will  have  identical 
values  represented  by  9*m,  m  =  1,2...,.  This  process  is  typically  referred  to  as  a 
Chinese  restaurant  process,  which  as  described  as  follows:  Customer  xn  walks  into  a 
restaurant  in  which  there  are  an  infinite  number  of  tables,  and  the  customer  must 
choose  a  table  denoted  as  9n.  The  probability  that  a  customer  chooses  to  sit  at  a 
particular  table  is  as  follows: 


P  (9n\9i,  02,  •••,  0?i— i)  = 


l(*m) 


9*n  with  probability  ~n_1+a 
New  draw  from  G  with  probability 


(4.22) 


n—  1+a 


In  (4.22),  nnm„_i  (0^J  denotes  the  number  of  people  who  are  already  sitting  at 
table  9*n,  and  a  is  the  DP  scaling  parameter.  The  processes  described  by  (4.22)  sug¬ 
gests  that  as  the  restaurant  fills  up  with  people,  new  customers  will  be  more  likely 
to  select  tables  at  which  a  large  number  of  people  are  sitting.  However,  there  is  a 
probability  (which  is  a  function  of  a)  that  a  new  customer  will  select  an  unoccu¬ 
pied  table.  From  the  Chinese  restaurant  process,  G  can  be  described  as  a  discrete 
probability  density  that  assigns  mass  to  an  infinite  number  of  atoms, 


G=^7rmde*m,  (4.23) 

m=  1 


where  the  atoms  are  delta  functions  located  at  each  6*n .  The  mixing  proportions 
given  by  7rm  can  be  estimated  by  sampling  from  G  and  calculating  the  proportions 
of  customers  seated  at  each  table. 

Another  construction  of  the  DP  that  constrains  the  proportions  to  sum  to  unity 
is  the  stick-breaking  process  [105].  In  a  stick-breaking  construction,  the  values  of  7rm 
can  be  expressed  as  the  relative  proportions  of  an  infinite  number  of  random  pieces 
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sequentially  broken  off  a  unit-length  stick: 


m— 1 

7Tm  (v)  =  vm  (1  —  Vj) ,  m  =  1,  2, oo  (4.24) 

j 

vm  ~  Beta  (1,  a)  (4-25) 

The  sizes  of  the  individual  pieces,  vm ,  that  are  broken  off  the  remainder  of  the 
stick  are  drawn  from  a  Beta  distribution  controlled  by  the  a  parameter.  Similar 
to  the  Chinese  restaurant  process,  the  stick-breaking  process  yields  a  G  consisting 
of  a  countably  infinite  set  of  atoms,  for  which  the  vast  majority  have  negligible 
proportion: 

OO 

G  =  Yl  ^  (4-26) 

m=  1 

However,  in  this  case,  the  values  of  nm  are  a  function  of  v  according  to  (4.24). 

The  DP  has  been  shown  to  be  useful  as  a  prior  density  in  nonparametric  mixture 
models ,  and  the  stick-breaking  process  is  particularly  amenable  to  variational  learning 
since  it  can  be  incorporated  into  a  fully-conjugate  graphical  model  [67,69,100,101]. 
Nonparametric  models  differ  from  parametric  models  not  in  that  they  have  no  pa¬ 
rameters,  but  in  that  the  number  of  unique  parameters  (i.e.,  the  effective  model 
order)  controls  model  complexity  rather  than  just  the  shape  of  the  PDF.  A  DP  prior 
is  a  useful  mechanism  for  regulating  the  number  of  parameters,  thereby  effectively 
determining  the  model  order  and  avoiding  overfitting.  These  approaches  are  gener¬ 
ally  referred  to  as  Dirichlet  process  mixtures,  in  which  G  is  a  conjugate  prior  density 
for  an  infinite  number  of  mixture  components.  Therefore,  the  unique  draws  0*n  are 
the  parameters  that  govern  the  mth  component,  for  m  =  1,  2, ...,  oo. 

In  the  following  sections,  two  types  of  DP  mixtures  are  presented  to  automate 
learning  of  an  unsupervised  context  model  of  unknown  order.  First  is  the  DP  Gaus¬ 
sian  mixture  model  (DPGMM),  which  facilitates  learning  the  number  of  GMM  corn- 
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ponents.  The  second  model  is  the  DP  mixture  of  factor  analyzers  that,  like  the 
DPGMM,  will  automate  learning  of  the  number  of  clusters  as  well  as  a  locally- 
reduced  dimensionality  for  each  cluster. 

4.3  Dirichlet  Process  Gaussian  Mixture  Model 

The  GMM  was  shown  in  Chapter  3  as  an  example  of  a  basic  unsupervised  context 
model  that  clusters  the  contextual  features  X(C)  into  M  contexts,  with  each  context 
represented  by  a  single  Gaussian  mixture  component.  However,  the  behavior  and 
overall  benefit  of  using  a  fixed-order  GMM  depends  on  whether  the  model  order 
(i.e. ,  the  number  of  contexts)  was  set  correctly  [89].  The  DPGMM  improves  upon 
the  fixed-order  GMM  by  allowing  the  effective  number  of  mixture  components  to  be 
learned  from  the  data  by  performing  Bayesian  inference  [67] .  The  likelihood  function 
of  the  DPGMM  context  model  is  given  by 

OO 

p(x'°  |v,/j,A-1)  =  ^7TTO(v)M(x(cVm,  A^1)  ,  (4.27) 

m= 1 

where  /JLrn  are  the  component  means,  Am  are  the  component  precision  (inverse  covari¬ 
ance)  matrices,  and  nm  (v)  are  the  mixing  proportions  drawn  from  the  stick-breaking 
process  given  by  (4.24).  VB  inference  can  be  performed  on  this  model  by  assum¬ 
ing  conjugate  priors  on  all  of  the  model  parameters,  as  well  as  the  hyperparameter, 
a,  controlling  the  stick-breaking  process.  The  data-generating  process  for  a  fully- 
conjugate  DPGMM  is  as  follows: 

1.  Draw  a  ~  Gamma  (tio,  T20) 

2.  Draw  vm\a  ~  Beta  (1,  a) 

3.  Draw  9*m\G0  ~  U  (n*m\p0,  V LA)n“i)  W  (A,*n|B0,  u0) ,  m  =  1,  2, ... 

4.  Calculate  mixture  proportions  7 rm  (v)  =  D.n;ri1(i  —  Vj),m  =  1,  2, ... 
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5.  For  n  —  1,  2, N 


(a)  Draw  indicator  variable  cn|v  ~  Multinomial  (77) 

(b)  Draw  data  (x,(,C)|cnrn  =  1  j  ~  J\f 

In  practice,  the  DPGMM  is  initialized  with  T  clusters,  where  T  is  an  arbitrarily 
large  number,  using  the  A;-means  algorithm.  In  the  experiments  presented  in  this 
chapter,  the  following  hyperparameter  settings  were  used:  Uq  =  1,  T10  =  r2o  =  1, 
uq  =  D^c\  B0  =  D^ID(c),  and  p0  was  set  equal  to  the  sample  mean  of  X(Cd 
These  hyperparameter  settings  were  not  optimized  for  any  particular  problem,  as 
the  DPGMM  did  not  appear  to  be  very  sensitive  to  their  settings  for  the  problems 
that  were  considered.  Details  on  variational  inference  for  the  DPGMM,  including 
derivations  of  all  posterior  update  equations  and  the  negative  free  energy,  are  included 
in  Appendix  C. 

The  stick-breaking  prior  imposes  a  clustering  effect  on  the  parameters  of  each 
cluster  that  consolidates  them  to  a  few  unique  values.  For  the  purpose  of  context- 
dependent  learning,  a  pruning  criterion  was  imposed  to  ensure  that  all  contexts  con¬ 
tained  enough  points  for  performing  classification.  Therefore,  all  clusters  accounting 
for  less  than  1%  of  points  were  pruned  from  the  model  to  yield  M  clusters  such 
that  M  «  T.  The  following  variational  posteriors  were  obtained  for  the  model 
parameters: 

Q  (Mm.  rm)  =  A/”  (pm\ pm,  u-1  A,"1)  W  (A|i/m,  Bm) ,  m  =  1, 2, ...,  T  (4.28) 

q  (cn)  =  Multinomial  (4>n)  (4.29) 

For  new  (test)  values  of  x(C\  context  posteriors  are  obtained  by  integrating  out 
the  model  parameters  to  yield  an  a  posteriori  mixture  of  Student’s  ^-distributions 
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Figure  4.1:  Example  of  the  DPGMM  learned  on  a  mixture  of  9  Gaussian  distribu¬ 
tions.  The  top  row  illustrates  the  predictive  density,  and  the  bottom  row  illustrates 
the  component  membership  matrix,  at  learning  iterations  1-9. 


given  by 


.  tUm 

p  (cnm  =  l|xn  )  =  —  /  (c)  — 

L,  i  G,  ( x,,  ()j.  v 


\pj,WJ 


(4.30) 


where  context  m  is  represented  by  a  Student’s  ^distribution  with  um  =  um  +  l  —  D^ 
degrees  of  freedom,  mean  pm,  and  covariance  Wm  =  [{um  +  1)  /umum]  Bm'  [91]. 

Figure  4.1  illustrates  an  example  of  the  DPGMM  being  trained  on  a  mixture 
of  9  bivariate  Gaussian  distributions  arranged  in  a  diamond  shape.  The  top  row 
illustrates  the  predictive  density  obtained  by  integrating  over  the  model  parameters 
at  VB  iterations  1,  3,  5,  7,  and  9.  The  bottom  row  illustrates  the  membership 
matrix  for  each  of  300  training  points.  Variational  inference  was  initialized  with 
T  =  20  mixture  components,  so  the  membership  matrix  is  initially  distributed  evenly 
between  the  20  columns.  As  the  number  of  iterations  increases,  the  memberships 
consolidate  to  9  columns.  Furthermore,  the  predictive  density  converges  to  a  mixture 
of  the  9  true  densities  from  which  the  training  data  was  drawn. 
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In  many  model  parameter  estimation  problems,  it  is  difficult  to  perform  inference 
reliable  with  high-dimensional  data.  This  is  referred  to  as  the  curse  of  dimensionality 
in  most  statistics  texts  [70-72],  and  the  problem  manifests  itself  in  the  DPGMM. 
Each  covariance  matrix  requires  the  estimation  of  D^c\D^cl  +  l)/2  unique  parame¬ 
ters,  where  is  the  dimensionality  of  the  context  feature  space,  and  this  could  be 
very  expensive  computationally  if  is  large.  Furthermore,  the  number  of  sam¬ 
ples  N  must  be  much  greater  than  D(c>  in  order  to  avoid  over-fitting,  which  becomes 
more  difficult  to  achieve  if  is  large.  Therefore,  the  DPGMM  was  trained  on  the 
3-D  PGA  projection  of  the  contextual  features  for  this  work. 

It  is  possible  that  the  various  contexts  over  which  data  was  collected  may  be  char¬ 
acterized  by  different  contextual  factors,  suggesting  that  a  unique  number  of  features 
might  characterize  each  context.  Therefore,  another  nonparametric  context  model  is 
proposed  in  the  following  section  for  learning  a  low- dimensional  projection  of  each 
cluster.  This  model,  the  Dirichlet  process  mixture  of  factor  analyzers  (DPMFA)  can 
potentially  avoid  the  curse  of  dimensionality  without  having  to  specify  the  number 
of  latent  feature  dimensions. 

4.4  Dirichlet  Process  Mixture  of  Factor  Analyzers 

Recall  Chapter  2  in  which  a  variety  of  contextual  features  were  proposed  for  charac¬ 
terizing  multiple  environmental  factors  from  time-domain  GPR  data.  It  is  possible 
that  different  environmental  factors,  and  therefore  features,  may  characterize  the 
various  contexts  over  which  data  was  collected.  For  example,  distinguishing  between 
a  dirt  road  and  a  concrete  road  may  only  need  to  be  based  on  one  factor,  the  soil 
dielectric  constant.  However,  some  concrete  roads  may  be  reinforced  with  rebar; 
therefore,  subsurface  heterogeneity  may  need  to  also  be  considered  for  distinguishing 
different  types  of  concrete  roads  from  a  dirt  road.  The  DPGMM  assumes  that  all 
of  the  learned  contexts  have  the  same  dimensionality,  and  therefore  utilize  the  same 
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contextual  information.  In  contrast,  it  could  possibly  be  beneficial  to  use  a  context 
model  that  not  only  facilitates  learning  the  number  of  contexts,  but  also  the  local 
dimensionality  of  each  context. 

One  technique  for  dimensionality  reduction  that  has  been  given  a  Bayesian  treat¬ 
ment,  and  therefore  is  easily  implemented  in  the  proposed  context-dependent  learning 
framework,  is  factor  analysis  [71,98].  Closely  related  to  PCA,  a  factor  analysis  model 
expresses  the  data  x  as  a  projection  of  K  77-dimensional  latent  factors,  A,  onto  the 
K  x  1  scores,  s,  biased  by  D  x  1  mean,  /i.  The  projection  error  is  assumed  to  be 
Gaussian  with  covariance  matrix  if-1!,  so  that 

p  (x|  A,  s,  n)  —  J\f  (x|  As  +  n,  I)  .  (4-31) 

Equation  (4.31)  is  the  same  distribution  assumed  for  the  projection  error  of  PCA, 
with  the  only  difference  is  that  in  factor  analysis  the  covariance  I  is  assumed  to 
be  diagonal  rather  than  isotropic. 

A  mixture  of  factor  analyzers  (MFA)  is  similar  to  a  GMM,  except  that  each 
component  is  described  by  a  local  variant  of  (4.31): 

M 

P  (x|  A,  *,**)  =  £  VmN  (x|Ams  +  nm,  'll)-1 1)  (4.32) 

771= 1 

A  VB  inference  approach  to  the  MFA  was  proposed  in  [68],  and  like  the  original 
VBGMM  [91]  assumed  a  Diric-hlet  prior  on  the  mixture  proportions.  Furthermore, 
the  MFA  assigns  an  independent  loading  matrix  Am  to  each  mixture  component,  for 
which  learning  may  be  difficult  on  a  small  data  set  or  if  outliers  are  present. 

An  more  feasible  approach  is  to  impose  a  binary-coded  zm  on  each  mixture  com¬ 
ponent  to  selects  vectors  from  a  common  loading  matrix  [94]: 

M 

V  (x|  A,  S ,n)  =  Y^  (x|  Adiag  (zm)  s  +  fim,  ^~1l)  (4.33) 

771=  1 
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Although  performing  Bayesian  inference  on  the  MFA  will  yield  a  full  posterior  density 
for  each  of  the  M  mixture  components,  the  effective  number  of  mixture  components 
could  only  be  found  by  using  a  posteriori  point  estimates  of  7r  and  applying  a  thresh¬ 
old.  Instead,  a  stick-breaking  prior  may  be  assumed  for  the  mixing  proportions  7 r, 
yielding  a  Dirichlet  process  mixture  of  factor  analyzers  (DPMFA): 

M 

P  (x)  =  ^  (x I  Adiag  (zm)  S  +  nm,  ^l)  (4.34) 

m=  1 

Originally  proposed  as  part  of  graphical  model  for  classifying  missing  data  [94],  the 
DPMFA  is  used  in  this  work  for  generative  context  modeling  in  the  feature  space 
XAi.  The  stick-breaking  prior  on  7r  (v),  given  by  (4.24),  will  impose  a  pruning  effect 
on  extraneous  mixture  components.  This  forces  the  corresponding  elements  of  7r 
to  zero.  A  Bernoulli  prior  is  also  placed  on  the  elements  of  z.m  to  automate  factor 
selection  for  each  local  cluster.  The  data-generating  process  for  a  fully-conjugate 
DPMFA  context  model  is  as  follows: 

1.  Draw  a  ~  Gamma  (tio,  T20 ) 

2.  Draw  vm\a  ~  Beta  (1,  a) ,  rri  —  1,  2, ... 

3.  Calculate  mixture  proportions  nm  (v)  =  =  1,2,... 

4.  Draw  ^dk  ~  gamma  (e0,  fo),  d  =  1,  2, ...,  D(C\  k  =  1,  2, ...,  K 

5.  Draw  Adk  ~  A/”  (Adfc|0,  y^1)  ,  d  =  1,  2, ...,  D{C\  k  =  l,2,...,K 

6.  Draw  ~  Beta  (ao/K,  60  (K  —  1)/K) ,  k  =  l,2,...,K,  m—  1,2,... 

7.  Draw  zrnk  ~  Bernoulli  {zmk\(mk) ,  k  =  1,  2, ...,  K,  m  =  1,  2, ... 

8.  Draw  ipmk  ~  Gamma  (^mj|.9o,  /io) ,  k  =  1, 2, ...,  K,  m  =  1,  2, ... 
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9.  Draw  pm  ~  A/”  (pm\p0,  u0  Miag  (i ipj ))  ,  m  =  1,  2, ... 

10.  Draw  5  ~  Gamma  (<5i0,  52o) 

11.  For  n  —  1,  2, ...,  N 

(a)  Draw  sn  ~  J\f  (s„|0,  5_1I) 

(b)  Draw  indicator  variable  cre|v  rsj  Multi  (7r) 

(c)  Draw  data  (x.^ \cnm  =  lj  ~  J\f  ^xLC)|Adiag  (zm)  sn  +  pm,  ) 

The  DPMFA  model  was  initialized  with  T  —  20  clusters,  using  the  fc-means 
algorithm.  Furthermore,  the  number  of  factors  was  capped  at  K  =  10.  In  the 
experiments  presented  in  this  chapter,  the  following  hyperparameter  settings  were 
used:  a0  =  1,  b0  =  0.5,  e0  =  g0  =  Sm  =  0.1,  f0  =  h0  =  520  =  10,  and  tw  =  r20  =  1. 
These  hyperparameter  settings  were  chosen  to  limit  sparseness  in  the  factor  loadings 
and  scores,  and  allow  selection  to  be  governed  by  inference  of  z.  However,  the 
values  were  not  specifically  optimized  for  any  particular  problem  and  were  used  for 
experiments  with  synthetic  as  well  as  real  data.  Variational  inference  yields  the 
following  variational  posteriors  on  the  model  parameters: 


Q  {A<ik)  —  N  (Adklujdk,  Odk)  (4.35) 

q  (sn)  =  Mk  (s n | An)  (4.36) 

q  (■ Zmk )  =  Bernoulli  (■ rjmk )  (4.37) 

q  (aO  =  A/'dcc)  (pm,  Um)  (4.38) 

q  (Aj)  =  Gamma  (gtj,  htj)  (4.39) 

q  (cn)  =  Multinomial  (0n)  (4.40) 


In  practice,  the  variational  expectations  of  the  factor  loadings  (oJdk),  scores  (£n), 
selectors  ( rjmk ),  as  well  as  the  mixture  component  means  ( pm ),  variances  ('iptj),  and 


84 


memberships  (<f>n)  were  used  as  the  learned  model  parameters.  Additionally,  a  prun¬ 
ing  criterion  was  imposed  on  the  mixture  components.  All  components  accounting 
for  less  than  1%  of  points  were  pruned  from  the  model  to  yield  M  clusters  such  that 
M  «  T. 

Figure  4.2  presents  an  example  of  a  factor  analysis  model  to  highlight  the  be¬ 
havior  and  performance  of  the  DPMFA  model.  In  this  example,  data  was  generated 
from  a  known  factor  loading  matrix,  mixture  component  means,  and  selection  vec¬ 
tors  while  using  random  scores.  The  factor  loading  matrix  was  specified  by  having 
pairs  of  features  share  a  single  factor  loading.  These  shared  elements  of  A  were  set 
to  one  while  the  remaining  elements  were  set  to  zero.  The  factor  scores,  S,  were 
randomly  drawn  from  a  zero-mean,  unit-variance  Gaussian  distribution.  The  data 
was  partitioned  into  three  clusters,  and  unique  factor  selection  vectors  were  specified 
for  each.  The  first  cluster  (samples  1-500)  utilized  three  factors,  the  second  cluster 
(501-1000)  utilized  two  factors  that  were  distinct  from  those  in  the  first  cluster,  and 
the  third  cluster  (1001-1500)  also  utilized  two  factors,  each  shared  with  one  of  the 
two  other  clusters.  Furthermore,  the  first  cluster  was  biased  by  a  mean  value  of  5, 
the  second  cluster  was  biased  by  a  mean  value  of  -5,  and  the  third  cluster  remained 
zero-mean.  White  noise  with  a  variance  of  0.5  was  then  added  to  the  data. 

In  this  example,  the  model  converged  to  a  solution  within  11  VB  iterations,  and 
yielded  the  expectations  to  the  model  parameters  shown  in  Figure  4.3.  Although  the 
learned  loading  matrix  does  not  match  the  structure  of  the  true  A  from  Figure  4.2, 
the  factors  are  sparse  and  shared  by  two  features  at  a  time.  This  discrete  selection  of 
factors  is  summarized  by  the  bottom-left  image,  which  illustrates  the  learned  factor 
selection  vectors.  Finally,  the  clustering  results  are  summarized  by  the  expected 
memberships  at  bottom-right,  which  are  illustrated  by  the  probabilities  $.  Clearly, 
three  distinct  clusters  have  been  learned.  The  means  show  that  the  correct  cluster 
locations  were  learned  (having  means  of  5,  -5,  and  0),  and  the  variances  are  close  to 
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Figure  4.2:  An  example  factor  analysis  problem  to  illustrate  DPMFA  model  per¬ 
formance.  Top-left:  true  factor  loading  matrix  (A);  Top-right:  factor  scores  (S), 
Bottom- left:  factor  selectors  (Z),  Bottom-right:  Original  data  (X). 


the  true  variance  of  0.5. 

Latent  feature  models  like  the  DPMFA  can  also  be  thought  of  as  a  technique  for 
signal/image  denoising.  By  learning  the  latent  factors  present  in  multi-dimensional 
data,  the  most  informative  parts  of  the  data  are  retained.  Considering  this  synthetic 
example,  the  data  matrix  X  can  be  thought  of  as  an  N  x  D  image,  in  which  the 
features  corresponding  to  the  shared  factors  are  the  informative  parts.  Reproducing 
X  by  substituting  the  posterior  expectations  of  the  model  parameters  into  4.34  yields 
an  image  similar  to  the  original  data,  but  with  the  noise  removed  and  the  informative 
features  retained. 

As  shown  by  the  example,  the  DPMFA  performs  joint  clustering  and  feature  se¬ 
lection  in  an  unsupervised  manner.  In  context  modeling,  this  is  important  because 
certain  contexts  may  be  explained  by  different  environmental  factors.  For  example, 
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Figure  4.3:  A  posteriori  expected  values  of  the  DPMFA  model  parameters  learned 
from  the  example  data  shown  in  Figure  4.2.  Top-left:  learned  loading  matrix  (A); 
Top-right:  learned  factor  scores  (S);  Center-left:  learned  factor  selectors  (Z);  Center- 
right:  learned  cluster  memberships  (0);  Bottom- left:  learned  component  means  (p); 
Bottom- right:  learned  component  variances  (0) 


distinguishing  between  GPR  data  collected  on  a  homogeneous  dirt  lane,  but  in  dif¬ 
fering  moisture  conditions,  may  only  be  dependent  on  one  feature  (e.g.,  soil  dielectric 
constant).  However,  distinguishing  these  contexts  from  paved  or  heavily-cluttered 
soils  may  require  additional  information  (e.g.  subsurface  heterogeneity).  Given  a 
large  number  of  contextual  features  for  characterizing  multiple  environmental  fac¬ 
tors,  the  DPMFA  is  useful  because  it  automates  learning  of  the  number  of  contexts 
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Figure  4.4:  Results  of  denoising  the  example  data  shown  in  Figure  4.2  with 
DPMFA.  Left:  original  data,  shown  in  the  bottom-right  of  Figure  4.2;  Right:  de¬ 
noised  data,  calculated  from  learned  DPMFA  model  parameters. 

and  also  the  local  dimensionality  of  each. 

4.5  Experimental  Results 

Preliminary  results  of  using  the  DPGMM  in  context-dependent  algorithm  fusion 
were  presented  in  [79]  using  a  smaller  data  set  considering  only  antitank  landmine 
targets.  In  this  section,  experimental  results  are  presented  for  using  the  DPGMM  and 
DPMFA  in  context-dependent  algorithm  fusion  on  the  data  set  that  was  summarized 
in  Section  3.4.  First,  the  results  of  context  learning  are  analyzed  by  comparing 
the  contexts  learned  by  the  DPGMM  and  DPMFA  to  the  known  labels.  Then,  the 
RVM  weights  learned  for  performing  context-dependent  algorithm  fusion  using  either 
DPGMM  or  DPMFA  contexts  are  compared.  Finally,  the  detection  performance  of 
context-dependent  algorithm  fusion  using  both  context  models  are  compared  to  the 
basic  approaches  originally  shown  in  Figure  3.5. 

4-5.1  Context  Learning  with  the  DPGMM 

The  DPGMM  was  trained  on  the  3-D  PGA  projection  of  the  normalized  GPR  context 
features.  Initialization  was  set  to  T  =  30  clusters  using  the  h- means  algorithm.  The 
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FIGURE  4.5:  Scatterplot  comparing  results  of  context  learning  using  the  DPGMM 
on  the  GPR  contextual  features  to  the  known  soil  labels.  Left:  Scatter  plot  of  3-D 
PCA  projection  of  contextual  features,  with  points  colored  by  qualitative  soil  label. 
Right:  Same  scatter  plot,  but  with  points  colored  by  MAP  DPGMM  component. 


hyperparameters  were  set  according  to  the  same  values  used  in  the  synthetic  data 
example,  with  ZAC>  =  3  since  the  PCA-projected  context  features  were  used.  Of  the 
30  initial  clusters,  the  DPGMM  converged  to  19  within  the  1%  pruning  threshold. 
Figures  4.5  and  4.6  illustrates  the  performance  of  the  DPGMM  in  clustering  the 
context  features.  The  left  panel  of  Figure  4.5  illustrates  the  scatterplot  of  the  PCA- 
projected  context  features,  with  the  points  colored  by  the  known  soil  labels.  The 
right  panel  shows  the  contexts  obtained  by  assigning  points  to  the  MAP  DPGMM 
component.  The  similarity  matrix  comparing  the  contexts  learned  by  the  DPGMM 
to  the  known  labels  is  shown  in  Figure  4.6. 

Contexts  2,  3,  4,  6,  9,  10,  12,  13,  14,  and  16  were  predominantly  dirt.  Context  18 
was  predominantly  gravel.  Contexts  1,  5,  7  and  15  were  roughly  split  between  dirt 
and  gravel,  suggesting  a  possible  overlap  of  soil  properties  between  these  two  labeled 
categories.  Asphalt  made  up  most  of  Context  11,  and  concrete  made  up  most  of 
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FIGURE  4.6:  Similarity  matrix  comparing  DPGMM  clustering  results  to  the  known 
soil  labels. 


context  8.  Context  17  was  roughly  split  between  asphalt  and  concrete.  No  context 
overlapped  significantly  between  one  of  the  unpaved  soil  types  and  one  of  the  paved 
categories.  These  results  suggest  that  in  a  large  GPR  collection  such  as  this,  there 
may  be  a  wealth  of  contextual  information  beyond  the  scope  of  the  available  soil 
labels  that  can  be  learned  using  a  nonparametric  model. 

The  learned  model  parameters  are  shown  by  Figure  4.7,  which  illustrates  the 
cluster  means,  pm,  and  Figure  4.8,  which  illustrates  the  covariances  Am.  These  plots 
illustrate  that  each  context  corresponds  to  a  Student ’s-f  distribution  with  unique 
mean  and  covariance. 

4-5.2  Context  Learning  with  the  DPMFA 

The  DPMFA  was  trained  on  the  full  23-dimensional  GPR  context  features  using  the 
same  hyperparameter  settings  from  the  synthetic  example.  Figure  4.9  illustrates  the 
similarity  matrix  obtained  by  comparing  the  known  soil  labels  to  the  MAP  contexts 
assigned  by  DPMFA  clustering.  In  this  case,  12  contexts  were  learned  that  met  the 
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FIGURE  4.7:  Means  of  clusters  learned  by  the  DPGMM  context  model.  The  hori¬ 
zontal  axis  represents  the  dimension  of  the  PCA-projected  features  X  A) ,  the  vertical 
axis  represents  the  mean  of  each  cluster  that  was  learned,  and  colors  represent  the 
individual  contexts. 


1%  pruning  threshold.  The  data  collected  over  dirt  and  gravel  were  split  into  many 
sub-contexts  by  the  DPMFA.  Most  of  these  sub-contexts  were  split  between  dirt 
and  gravel  data,  most  of  which  were  predominantly  comprised  of  dirt  data,  but  one 
(Context  7)  was  predominantly  gravel.  Asphalt  was  split  into  two  distinct  contexts 
(3  and  10),  and  was  rarely  confused  with  any  of  the  other  soil  types.  The  majority 
of  observations  in  context  2  were  concrete. 

Figure  4.10  illustrates  the  expected  model  parameters  that  were  learned  using  VB 
inference  on  the  DPMFA  model.  The  learned  factor  loadings  (each  vector  normalized 
to  unit-magnitude  for  illustration  purposes)  are  shown  in  the  top-left  panel,  the 
learned  scores  (scaled  by  the  corresponding  factor  magnitude)  are  shown  at  top-right, 
the  learned  selection  vectors  are  shown  at  bottom-left,  and  the  cluster  membership 
probabilities  are  shown  at  bottom-right.  The  membership  matrix  illustrates  that 
most  observations  fall  into  clusters  1,  3,  6,  7,  8,  9,  10,  11,  13,  15,  17,  and  20.  The 
factor  selection  matrix  shows  that  most  of  these  mixture  components  only  utilize  the 
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Figure  4.8:  Covariance  matrices  of  clusters  learned  by  the  DPGMM  context  model. 
Each  panel  represents  the  covariance  matrix  of  the  Student-f  mixture  components 
obtained  by  integrating  over  the  DPGMM  parameters. 


first  factor  that  was  learned.  However,  clusters  1,  8,  13,  17,  and  20  also  utilize  factor 
3,  but  the  scores  assigned  to  factor  3  are  small  (as  are  the  elements  of  the  factor 
vector  itself). 

An  interesting  observation  here  is  that  the  learned  factors  were  constructed  from 
projections  of  different  contextual  features.  The  features  with  the  greatest  magnitude 
in  factor  1  are  features  9-14,  which  correspond  to  the  late-time  portions  of  the  MP 
histogram.  Factor  1  may  therefore  characterize  soil  heterogeneity  and  attenuation 
properties.  Meanwhile,  the  features  with  greatest  magnitude  in  factor  3  are  features 
6  and  7,  which  correspond  to  early-time  portions  of  the  MP  histogram.  Therefore, 
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Similarity  Matrix:  DPMFA  Context  Learning 
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Figure  4.9:  Similarity  matrix  comparing  DPMFA  clustering  results  to  the  known 
soil  labels. 

factor  3  may  characterize  the  near-surface  properties  of  the  soil.  Surprisingly,  features 
1-2  (energy  and  reflection  coefficient)  have  a  very  small  magnitude  in  both  factors. 

Figure  4.11  illustrates  the  expected  means,  pm,  of  each  of  the  23-dimensional 
clusters  learned  from  the  DPMFA.  Figure  4.8  illustrates  the  variances  (i.e.  projection 
residual),  i/im,  of  each  dimension  within  each  cluster.  Each  context  is  characterized 
by  a  unique  mean  and  covariance.  Several  of  the  contexts  have  means  located  near 
zero,  while  others  appear  to  be  on  the  outskirts  of  the  feature  space. 

The  variances  of  the  DPMFA-lcarned  contexts  should  not  be  considered  neces¬ 
sarily  as  variances  in  the  Gaussian  sense,  but  also  the  residual  of  the  factor  analysis 
projection  of  X^A  Therefore,  the  features  corresponding  to  nulls  in  variance  are 
best  characterized  by  the  factors  selected  for  that  context.  By  this  observation,  each 
context  appears  to  have  a  unique  set  of  nulls  (although  contexts  1,  9,  and  11  appear 
to  be  very  similar),  suggesting  that  each  context  uses  different  contextual  feature  in¬ 
formation.  Furthermore,  each  context  yields  high  variance  on  features  14-23,  which 
correspond  to  the  LP  power  features.  Recall  from  Chapter  2  that  LP  power  decreases 
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Figure  4.10:  A  posteriori  expected  values  of  the  DPMFA  model  parameters  learned 
from  the  GPR  contextual  features.  Top-left:  learned  loading  matrix  (A);  Top-right: 
learned  factor  scores  (S);  Bottom-left:  learned  factor  selectors  (Z);  Bottom-right: 
learned  cluster  memberships  (0). 


exponentially  with  respect  to  temporal  index,  suggesting  that  the  feature  is  charac¬ 
teristic  of  attenuation  effects  in  soil.  Therefore,  the  high  variance  that  the  DPMFA 
yielded  for  the  LP  power  features  may  be  an  artifact  of  fitting  a  linear  model  to 
features  that  exhibit  a  nonlinear  relationship. 

4-5.3  Context-Dependent  Fusion  Results 

Context-dependent  algorithm  fusion  was  evaluated  using  the  DPGMM  and  DPMFA 
context  models.  Like  the  basic  supervised  and  unsupervised  context  learning  tech¬ 
niques  presented  in  Chapter  3,  posterior  context  probabilities  obtained  from  the 
DPGMM  and  DPMFA  were  used  in  training  a  mixture  of  RVMs  for  weighting  the 
confidences  of  the  Prescreener  [38],  EHD  [44],  SPSCF  [49],  and  HMM  [42]  algorithms. 
The  RVM  weights  obtained  for  the  DPGMM  contexts  are  plotted  in  Figure  4.13. 
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Means  of  Context  Distributions 
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Figure  4.11:  Means  of  clusters  learned  by  the  DPMFA  context  model.  The  horizon¬ 
tal  axis  represents  the  dimension  of  the  features  X(C\  the  vertical  axis  represents  the 
mean  of  each  cluster  that  was  learned,  and  colors  represent  the  individual  contexts. 


The  HMM,  by  far  the  best-performing  single  algorithm  on  this  data  set,  received 
the  most  weight  and  was  never  irrelevant.  Compared  to  the  HMM,  the  other  three 
algorithms  were  assigned  very  small  weight  and  their  relative  weights  varied  with 
respect  to  context.  Each  algorithm,  with  the  exception  of  HMM,  was  irrelevant  in 
at  least  one  context. 

Figure  4.14  illustrates  the  RVM  fusion  weights  obtained  for  each  of  the  contexts 
learned  from  the  DPMFA  context  model.  In  the  DPMFA  contexts,  the  HMM  did 
not  dominate  fusion  as  much  as  it  did  with  respect  to  the  DPGMM  contexts.  In  one 
context  (context  8),  it  was  actually  irrelevant.  Meanwhile,  the  prescreener  received 
large  weight  in  several  contexts,  but  it  was  irrelevant  in  one  context  (context  11). 
Each  context  therefore  yielded  a  unique  weighting  of  the  four  on-board  algorithms, 
with  each  algorithm  being  considered  irrelevant  in  at  least  one  context. 
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FIGURE  4.12:  Covariance  matrices  of  clusters  learned  by  the  DPGMM  context 
model. 

4-5-4  Detection  Performance 

Context-dependent  fusion  was  evaluated  using  the  same  10-fold,  object-based  cross- 
validation  method  used  to  generate  the  results  shown  in  Chapter  3.  The  ROC 
curves  plotted  Figure  4.15  illustrates  the  results  of  context-dependent  fusion  using 
the  DPGMM  and  DPMFA  context  models,  which  are  respectively  plotted  in  red  and 
blue.  Performance  is  compared  to  global  RVM  fusion  (black  dashed)  which  is  not 
context-dependent,  as  well  as  the  individual  algorithms  (dashed  lines).  The  plot  is  on 
the  same  axes  scale  as  Figure  3.3  for  easy  comparison  to  the  basic  context-dependent 
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RVM  Discriminant  Weights:  DPGMM  Contexts 
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FIGURE  4.13:  RVM  discriminant  weights  learned  for  algorithm  fusion  in  each 
DPGMM  context.  Each  stem  represents  a  particular  dimension  of  the  target  fea¬ 
ture  space,  the  vertical  axis  represents  the  weight  value,  and  the  individual  contexts 
are  indicated  by  line  color. 


RVM  Discriminant  Weights:  DPMFA  Contexts 
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FIGURE  4.14:  RVM  discriminant  weights  learned  for  algorithm  fusion  in  each 
DPMFA  context.  Each  stem  represents  a  particular  dimension  of  the  target  fea¬ 
ture  space,  the  vertical  axis  represents  the  weight  value,  and  the  individual  contexts 
are  indicated  by  line  color. 
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ROC  for  Context-Dependent  Fusion  via  Generative  Nonparametric  Bayes 


FIGURE  4.15:  ROC  curves  for  context-dependent  fusion,  using  either  the  DPGMM 
or  DPMFA  context  models,  compared  to  non-context-dependent  RVM  fusion  and 
the  individual  fused  algorithms.  The  ROC  consists  of  PD  versus  FAR,  measured  in 
false  alarms  per  square  meter,  as  a  function  of  decision  threshold. 

techniques  discussed  in  Chapter  3.  Results  illustrate  that  significant  performance  im¬ 
provements  (i.e.,  outside  the  90%  confidence  bounds  indicated  by  the  shaded  region) 
over  the  non-context-dependent  RVM  are  possible  by  incorporating  nonparametric 
Bayesian,  generative  context  learning. 

Context-dependent  fusion  with  the  DPGMM,  which  did  not  utilize  soil  labels  and 
also  did  not  require  the  specification  of  the  number  of  contexts  to  learn,  achieved 
significant  reductions  in  FAR  at  0.92  >  PD  >  0.25.  Context-dependent  fusion  us¬ 
ing  the  DPMFA  context  model,  which  used  even  less  a  priori  information  than  the 


98 


DPGMM,  also  yielded  significant  FAR  reduction  for  the  same  PD  range.  It  should 
be  noted  that  both  techniques  performed  better  than  context-dependent  fusion  us¬ 
ing  the  supervised  contexts  trained  according  to  the  known  soil  labels,  indicating 
that  additional  useful  contextual  information  can  be  exploited  using  nonparametric 
models. 

4.6  Conclusions 

In  this  chapter,  generative  techniques  for  Bayesian  learning  nonparametric  context 
models  were  presented  and  evaluated  on  the  proposed  GPR  context  features.  The 
two  context  models  were  the  DPGMM  and  the  DPMFA.  Both  techniques  utilize  DP 
priors  to  facilitate  learning  of  the  number  of  clusters  (contexts)  present  in  the  data. 
The  DPGMM  was  trained  on  the  3-D  PGA  projection  of  the  context  features,  while 
the  DPMFA  was  able  to  learn  a  unique  local  dimensionality  reduction  for  each  cluster. 
Performance  analysis  showed  that  nonparametric  models  can  potentially  exploit  in¬ 
formation  that  is  not  described  by  available  qualitative  context  labels.  Experimental 
results  on  field-collected  GPR  data  illustrated  that  using  generative  nonparametric 
context  models  to  aid  in  context-dependent  fusion  yields  significant  reductions  in 
FAR  for  a  wide  range  of  PD  when  compared  to  conventional  fusion. 

In  contrast  to  the  generative  learning  techniques  that  were  proposed  in  this  chap¬ 
ter,  the  following  chapter  presents  discriminative  techniques  for  GPR  context  mod¬ 
eling.  In  this  chapter,  context  models  were  trained  on  the  context  features  only 
without  to  regard  to  the  target  features  or  the  target/clutter  labels  of  each  observa¬ 
tion.  Alternatively,  discriminative  learning  would  find  contexts  that  yield  the  best 
overall  classification  of  targets  from  non-targets.  Instead  of  considering  both  con¬ 
text  identification  and  algorithm  fusion  as  independent  tasks,  discriminative  learning 
would  consider  them  jointly  to  yield  contexts  that  allow  for  the  best  classification  of 
targets  and  clutter  in  each. 
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5 


Discriminative  Nonparametric  Context  Learning 


In  the  previous  chapter,  two  generative  approaches  to  context-dependent  learning 
were  proposed  in  which  the  context  model  and  classifiers  were  learned  independently 
of  one  another.  In  generative  context  learning,  contexts  were  learned  based  on  the 
distribution  of  the  context  features  and  not  with  regard  to  the  target/clutter  class 
labels.  In  contrast,  it  may  be  desirable  to  learn  a  context-dependent  classifier  in  a 
discriminative  manner.  Discriminative  learning  may  be  useful  in  finding  contexts 
that  allow  for  the  best  separation  of  the  target  and  clutter  classes.  1 

In  this  chapter,  two  approaches  are  proposed  for  discriminative  context  learning. 
The  first  is  a  discriminative  treatment  of  the  DPGMM  context  model  coupled  with 
RVM  classifiers.  The  second  is  a  similar  technique  from  the  literature  that  utilizes 
non-sparse  linear  classifiers  and  operates  on  the  joint  context  and  target  features. 
A  comparison  of  both  models’  behavior  is  illustrated  through  several  examples  with 
synthetic  data.  Finally,  both  techniques  were  evaluated  for  GPR  algorithm  fusion 
and  performance  was  compared  to  previous  approaches. 

1  This  chapter  is  derivative  of  previously  published  work  [21] 
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5.1  Generative  vs.  Discriminative  Learning 


Statistical  classification  approaches  often  fall  into  one  of  two  categories:  generative 
or  discriminative  models.  A  generative  model  describes  how  likely  the  given  data  X 
was  generated,  and  involves  learning  parameters  ©  that  define  the  likelihood  function 
p(X|@).  Most  density  estimation  techniques  fall  under  the  umbrella  of  generative 
models,  including  the  GMM,  HMM,  and  the  h- nearest  neighbor  density  estimate 
[70-72],  Discriminative  models  seek  to  describe  how  data  is  classified,  and  involve 
learning  parameters  of  the  conditional  PDF  of  the  labels  t,  i.e.  p(t|X,  0).  Most 
classifiers  would  therefore  be  considered  discriminative  models,  including  Fisher’s 
linear  discriminant  [72],  support  vector  machines  (SVMs)  [92],  and  RVMs  [83,84], 

In  the  previous  chapters,  generative  techniques  were  proposed  for  training  a  con¬ 
text  model  (e.g.,  GMM,  DPGMM,  DPMFA)  without  regard  to  the  target/clutter 
labels  associated  with  each  observation.  Although  the  learned  contexts  may  be 
reflective  of  underlying  environmental  factors,  they  may  not  necessarily  allow  for 
the  best  discrimination  between  targets  and  clutter.  Because  the  ultimate  goal  of 
context-dependent  learning  is  to  improve  target  discrimination  across  varying  envi¬ 
ronments,  it  is  important  to  consider  the  potential  benefits  of  discriminative  context 
learning.  Discriminative  context  models  can  be  framed  as  a  special  case  of  the 
mixture-of-experts  family  of  models,  which  are  summarized  in  the  following  section. 

5.2  Mixture-of-Experts  Models 

In  many  classification  problems,  a  single  linear  model  may  not  be  sufficient  for  dis¬ 
criminating  between  classes.  Therefore,  many  nonlinear  classification  models  have 
been  proposed.  These  including  techniques  such  as  polynomial  discriminant  analy¬ 
sis  [72],  decision  trees  [70]  and  random  forests  [106],  neural  networks  [70,72],  and 
sparse  kernel  machines  including  SVMs  [92]  and  RVMs  [83,84].  For  each  of  these 
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techniques,  several  parameters  must  be  “tuned”  to  avoid  over-  or  under-training. 
Such  tuning  parameters  include  the  order  of  a  polynomial  discriminant  function,  the 
pruning  criteria  used  in  tree-based  methods,  the  number  of  hidden  layers  in  a  neural 
network,  or  the  Gram  matrix  used  in  training  an  SVM  or  RVM.  Context-dependent 
classification  is  a  clear  example  of  a  problem  requiring  a  nonlinear  decision  model. 
However,  it  is  important  to  avoid  the  pitfall  of  insufficient  training  clue  to  poor 
parameter  selection  while  still  maintaining  the  ability  to  discriminatively  train  the 
classifier. 

Mixture-of-experts  models  are  a  family  of  classification  and  regression  techniques 
that  approximate  a  nonlinear  model  by  an  mixture  of  locally-linear  “expert”  models. 
The  most  representative  of  this  family  of  classifiers  is  the  hierarchical  mixture  of 
experts  (HME)  [107],  in  which  the  distribution  of  the  binary  class  label,  t,  conditioned 
on  each  of  m  =  1,  2, ...,  M  experts  is  given  by 

P  0lXi  Wm)  =  or  (w £x)*  [1-0-  (w£x)] 1_t ,  (5.1) 

where  wm  are  the  weights  associated  with  expert  m,  and  er(-)  denotes  the  logistic 
sigmoid  function. 

The  HME  utilizes  a  linear  gating  network  of  p  —  1,2,  ...P  nodes,  each  corre¬ 
sponding  to  an  associated  binary  variable,  zp  =  {0, 1}.  The  value  of  zp  drawn  from 
a  Bernoulli  distribution  given  by 

P(*p|x,vi)  =  (vpx)"'1  [1  -  u  (vjx)]1"^  ,  (5.2) 

where  vp  are  the  parameters  of  the  distribution  governing  node  p. 

Given  the  state  of  the  gating  network,  the  conditional  distribution  on  the  labels, 
t,  takes  the  form 


M 

P  (t|X>  W>  T=  z)  =  II  [a  (WmX)f  (WmX)]  1_* 


771=1 


(5.3) 
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where 


p 

C m  =  Y[zP  (5.4) 

p= 1 

The  parameter  zp  allows  for  the  nesting  of  sub-networks,  so  that 

{zp  if  m  is  in  the  left  sub-tree  of  p 
1  -  otherwise.  (5'5) 

The  HME  is  learned  discriminatively,  and  ML  [107]  and  Bayesian  [108]  approaches 
have  been  proposed.  However,  the  same  caveats  regarding  model  order  that  were 
discussed  for  probabilistic  mixture  models  in  Chapter  3  also  apply  to  the  HME.  The 
order  of  the  HME  model  is  given  by  P,  the  number  of  unique  nodes,  and  M ,  the 
number  of  experts.  Both  must  be  specified,  and  improper  selection  of  P  and  M  could 
lead  to  over-  or  under-training,  which  could  result  in  poor  performance. 

This  chapter  considers  two  methods  for  discriminative  context  learning  based  on 
the  HME  paradigm,  but  the  linear  gating  network  is  replaced  with  a  network  based 
on  the  DPGMM,  which  was  originally  presented  in  Chapter  4.  The  DPGMM  gating 
network  allows  for  a  nonparametric  model,  facilitating  learning  of  the  number  of 
expert  component  classifiers,  using  previously-developed  learning  methods. 

The  two  methods  being  considered  for  discriminative  context  learning  differ  in 
the  features  used  for  classification  and  clustering,  as  well  as  their  accommodation 
of  sparse  classification  models.  The  first  technique  is  based  on  those  proposed  in 
the  Chapter  4,  and  involves  replacing  the  linear  gating  network  of  the  HME  with  a 
DPGMM,  and  the  logistic  experts  with  RVMs.  Thus,  this  approach  is  referred  to  as 
the  DPGMM-RVM.  A  novel  property  of  the  DPGMM-RVM  is  that  it  seeks  to  learn 
the  DPGMM  in  the  contextual  features,  while  also  training  the  RVMs  on  the  target 
features  [21]. 

The  second  discriminative  context  model  is  based  on  the  infinite  quadratically- 
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gated  mixture  of  experts  (IQGME)  [94],  The  IQGME  also  utilizes  a  DPGMM  gating 
network,  but  performs  classification  and  clustering  in  the  same  feature  space.  There¬ 
fore,  it  is  not  amenable  to  sparse  classifiers.  The  derivations  of  both  the  DPGMM- 
RVM  and  IQGME  are  presented  in  greater  detail  in  Section  5.3,  and  performance  is 
compared  in  a  series  of  synthetic  data  examples  in  Section  5.4. 

5.3  Discriminative  Context  Models 

Consider  the  DPGMM  context  model  whose  likelihood  function  is  given  by  (4.27). 
The  stick-breaking  prior  is  initialized  with  a  truncation  level  of  T,  and  the  DPGMM 
will  cluster  the  contextual  features  into  M  mixture  components  where  M  <  T. 
Additionally,  consider  the  RVM  classifier  whose  likelihood  function  is  given  by  (3.7) 
and  (3.8).  The  RVM  incorporates  a  sparseness-promoting  prior  on  the  weights  (w) 
that  are  used  to  classify  the  target  features  (X(iQ  according  to  the  labels  (t). 

Inference  could  be  performed  on  the  DPGMM  and  RVM  jointly  using  a  discrim¬ 
inative  model  referred  to  here  as  the  DPGMM-RVM.  The  likelihood  function  of  the 
DPGMM-RVM  is  given  by 

p(t,X«c>|X'T>,C,W,M,A)  = 

NT  (5.6) 

n  n  (w™x«r))f"  i1  -  ^  (w™xiT))] 1  <n  ^d(c)  i  Mm,  a-1) 

n=  1  m= 1 

where  N  denotes  the  number  of  observations,  T  denotes  the  truncation  level,  TV6'-) 
denotes  the  dimensionality  of  X(G\  and  cnm  is  the  binary  indicator  that  denotes  the 
context  of  the  nth  observation. 

The  DPGMM-RVM  model  can  be  learned  discriminatively  by  assuming  conjugate 
priors  and  using  VB  inference.  The  data-generating  process  for  the  fully-conjugate 
DPGMM-RVM  is  as  follows: 

1.  Draw  a  ~  Gamma  (tio,  rgo) 
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2.  For  m  =  1,  2, T 


(a)  Draw  vm\ a  ~  Beta  (1,  a) 

(b)  Draw  0*m\Go  ~  UD(o  (p*m\ p0,  1A^_1)  W  (A^|B0,  v 0) 

(c)  Calculate  mixture  proportions  7rm  (v) 

(d)  Draw  /3md  rs./  Gamma  (o0,  bo) ,  d  —  1,  2, Z%7 1 

(e)  Draw  w,m  ~  A/”D(t)  (0,  diag  (/3m)_i) 

3.  For  n  =  1,  2, TV 

(a)  Draw  indicator  variable  cn  ~  Multi  (n) 

(b)  Draw  data  xiC)|cnm  =  1  ~  J\fD(c)  ^xLC)|0*„) 

(c)  Draw  label  tn\cnm  =  1  ~  cr(w^xiT))*n 

Inference  on  the  DPGMM-RVM  will  seek  to  perform  clustering  the  D ^-dimensional 
contextual  features  X((0  while  training  sparse  linear  classifiers  in  the  D(I  ^-dimensional 
target  features  X(I  b  For  all  experiments,  the  following  prior  hyperparameter  settings 
were  used:  a0  =  b0  =  u0  =  1,  r10  =  r2 o  =  0.01,  v  =  D ^G\  B0  =  D^Id<_c),  and  p0 
was  set  equal  to  the  sample  mean  of  X(CA  Variational  inference  was  performed  until 
the  NFE  converged  within  0.01%.  All  details  regarding  VB  for  the  DPGMM-RVM, 
including  update  equations  and  the  NFE,  are  derived  in  Appendix  E. 

The  structure  of  the  DPGMM-RVM  allows  for  mean-field  updates  of  the  DPGMM 
and  RVM  parameters  to  be  performed  independently  of  one  another.  Only  in  the 
update  for  the  cluster  responsibilities  (the  variational  parameters  governing  the  pos- 
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terior  on  C),  are  both  sets  of  parameters  used: 
log  (pnm  oclog  q(cnm  =  1) 

(x(logp(T!X(c)|C,X(r),-))+logp(C) 
oc(logp  (T|C,X<t>))  +  (log p  (X^|C))  +  logp(Z) 

L  /  'j1 

oc  log  a  (fnm)  +  -  ([2 tn  -  1]  (w^)xf }  -  £nm)  -  A  (fnm)  (xf }  (wmw^)xf } 

+  ^(log  |  Ami)  -  \({Xn]  ~  Mm)T  Am  (xf }  -  Mm)) 

+  (log  vm)  +  E(i°g(i-^))’ 

l<m 

(5.7) 


where  A  and  £  are  defined  in  the  RVM  derivation  found  in  Appendix  B,  and  (•) 
denotes  variational  expectation. 

The  first  line  of  the  final  expression  in  (5.7)  is  the  expectation  of  the  RVM 
log-likelihood  given  by  (B.38),  the  second  line  is  the  expectation  of  the  GMM  log- 
likelihood  given  by  (C.8),  and  the  third  line  is  the  stick-breaking  prior.  The  prior 
will  regularize  the  updates  for  both  the  DPGMM  and  RVM  parameters,  and  the 
DPGMM  and  RVM  will  also  regularize  one  another.  Therefore,  instead  of  learning  a 
DPGMM  that  fits  X(C’l  well,  or  a  set  of  RVMs  that  predict  t  well,  the  DPGMM- RVM 
will  seek  a  model  that  satisfies  both  criteria. 

An  alternative  approach  would  be  to  perform  clustering  and  classification  in  the 
combined  feature  space  X  =  [X(C),  X(I^]  which  has  dimensionality  D  =  D^  +  D^\ 
The  likelihood  function  of  this  model  is  similar  to  the  DPGMM-RVM: 


N  M  r 


pM’-  =nn 


n= 1  m=l 


( w TOXn 


1  —  cm  w ;nx„ 


1  tn 


A"n  (Xre|Mm)  Afc 


(5.8) 


The  model  given  by  (5.8)  was  originally  presented  in  [109]  as  the  quadratically- 
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gated  mixture  of  experts  (QGME)  for  classification  in  problems  with  missing  data. 
Incorporating  a  stick-breaking  prior  on  the  latent  variables  C  yields  the  infinite 
quadratically-gated  mixture  of  experts  (IQGME)  that  was  proposed  in  [94],  The 
QGME  and  IQGME  were  both  originally  proposed  for  classifying  data  with  missing 
dimensions,  so  context-dependent  learning  is  a  novel  application  for  this  type  of 
model. 

It  was  suggested  in  [94]  that  it  may  not  be  desirable  to  enforce  sparseness  in  the 
component  classifiers  if  they  are  all  jointly  operating  in  the  same  feature  space,  since 
sparse  component  classifiers  will  yield  a  decision  function  that  is  discontinuous  in 
the  joint  features  X.  Unlike  the  DPGMM-RVM,  the  QGME  and  IQGME  therefore 
utilize  a  common  Normal-Gamma  prior  on  the  classifier  weights  given  by 

W m  ~  Ad  (£,  diag  (/3)-1)  ,  (5.9) 

(£1 P)  ~  A fb  (0, 7cf1diag  (/3)-1)  ,  (5.10) 

f3p  ~  Gamma  (o0,  b0) ,  p  —  1,  2, ...,  D.  (5.11) 

The  data-generating  process  for  the  IQGME  is  very  similar  to  the  DPGMM-RVM, 
with  the  only  differences  being  that  clustering  and  classification  are  performed  on  the 
common  features  X  and  the  prior  given  by  (5.9)-(5.11)  is  imposed  on  the  classifier 
weights.  The  hyperparameter  settings  for  the  IQGME  in  all  experiments  were  very 
similar  to  the  DPGMM-RVM;  o0  =  b0  =  u0  =  1,  tw  =  r2 o  =  0.01,  y0  =  1 ,  v  =  D, 
Bo  =  Dip,  and  p0  was  set  equal  to  the  sample  mean  of  X.  VB  inference  was  also 
performed  until  the  NFE  converged  within  0.01%. 

Although  the  differences  between  the  DPGMM-RVM  and  IQGME  may  appear  to 
be  subtle,  the  novel  accommodation  of  sparse  linear  models  through  the  DPGMM- 
RVM  allows  for  markedly  different  performance.  These  differences  will  be  analyzed 
in  the  following  section  through  a  series  of  synthetic  data  examples. 
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5.4  Synthetic  Data  Examples 


In  this  section,  the  DPGMM-RVM  and  the  IQGME  are  compared  in  three  context- 
dependent  learning  problems  using  synthetic  data.  The  first  problem  considers  the 
case  in  which  all  features  are  informative;  i.e.  the  classes  are  separable  in  the  joint 
context  and  target  features.  The  second  problem  is  similar  to  the  first,  but  the  con¬ 
text  features  are  made  less  informative  by  increasing  the  variance  of  each  cluster. 
The  third  problem  considers  the  case  in  which  most  of  the  target  features  are  ir¬ 
relevant  in  each  context.  This  may  occur  in  GPR  algorithm  fusion  if  one  or  more 
algorithms  perform  poorly  in  certain  environments.  In  all  examples,  the  DPGMM- 
RVM  and  IQGME  were  initialized  with  a  clustering  truncation  of  T  =  20.  The 
DPGMM-RVM  and  IQGME  are  compared  based  upon  their  context  identification 
performance,  learned  discriminant  weights,  and  overall  classification  accuracy. 

Case  1:  All  Features  Informative 

Figure  5.1  provides  scatterplots  of  the  synthetic  target  and  contextual  features.  The 
target  features  were  drawn  from  Gaussian  distributions  conditioned  on  each  class 
and  context: 

p{^\H0,Cl)  =  Af  ([-3,  -2],  21),  p(^\HuCl)  =jV([0,0],2I) 

p(^\H0,c2)  =  Af  ([-4,-1],  21),  p(x(T)|^1,c2)  =  Af  ([—2,  0],  21) 

p(x.^\H0,c3)  =  A/"([0,  0],  21) ,  p(^\Huc3)  =  Af  ([-3,  -2],  21) 

p(x.^\H0,c4)  =  Af  ([-2,0],  21),  p(x.W\HuC4)  =jV([-4,-l],2I) 

In  the  aggregate  target  feature  space,  the  classes  appear  to  overlap  completely  as 
shown  in  the  left  panel.  The  context  features  were  drawn  from  four  distinct  Gaussian 
distributions: 

P^{C)\ci)  =  Af  ([—2, 2],  21) 
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Target  Features  Context  Features  Target  Feats.  -  Context  1  Target  Feats.  -  Context  2 


X<T>  «<c>  x" 


FIGURE  5.1:  Scatterplot  of  target  and  context  features  for  the  first  synthetic  data  ex¬ 
ample  to  illustrate  discriminative  context-dependent  learning.  Left:  two-dimensional 
aggregate  target  feature  space,  with  points  colored  by  class;  Center:  two-dimensional 
context  feature  space,  with  points  colored  by  context;  Right:  target  features,  split 
into  individual  contexts. 


p(x(c)|c2)  =  Af([2,2],2I) 
p(x(c)|c3)  =  Af  ([—2, 2],  0.51) 
p(x(c)|c4)  =  A/"  ([2,  —2],  0.51) 

The  center  panel  illustrates  the  two-dimensional  context  feature  space  and  the  dis¬ 
tinct  clusters  are  clearly  visible.  Conditioning  the  target  features  on  the  true  under¬ 
lying  contexts  reveals  four  classification  problems  that  are  almost  linearly  separable, 
as  shown  in  the  rightmost  panels. 

Figure  5.2  illustrates  the  clustering  results  obtained  from  the  DPGMM-RVM  in 
the  contextual  feature  space.  The  left  panel  shows  a  scatterplot  of  the  contextual 
feature  space,  with  points  colored  by  the  MAP  context  assigned  by  the  DPGMM- 
RVM.  The  similarity  matrix  between  the  learned  contexts  and  the  true  context  labels 
is  shown  in  the  right  panel,  illustrating  that  the  four  contexts  that  were  learned 
correspond  very  closely  to  the  true  contexts. 

The  clustering  results  obtained  from  the  IQGME  are  summarized  in  Figure  5.3.  A 
total  of  8  clusters  were  learned,  and  they  appear  to  overlap  in  the  contextual  feature 
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Figure  5.2:  Results  of  context  identification  using  the  discriminative  DPGMM- 
RVM  model  for  the  first  synthetic  data  example..  Left:  scatterplot  of  context  fea¬ 
tures,  with  points  colored  by  MAP  context;  Right:  similarity  matrix  of  true  and 
learned  context  assignments. 


space.  However,  recall  that  the  IQGME  performs  clustering  on  the  combined  con¬ 
textual  and  target  features.  Although  the  cluster  assignments  may  appear  to  overlap 
heavily  in  the  context  feature  space,  they  are  distinct  in  the  combined  features. 

The  differences  in  clustering  results  for  the  DPGMM-RVM  and  IQGME  are 
better-explained  by  comparing  the  classifiers  learned  by  each.  Figure  5.4  illustrates 
the  classifiers  corresponding  to  each  of  the  contexts  learned  by  the  DPGMM-RVM. 
Each  panel  shows  a  local  target  feature  space  in  which  points  are  colored  by  class, 
and  the  linear  decision  models  corresponding  to  each  context  are  also  shown.  In 
the  case  of  the  DPGMM-RVM,  each  context  is  representative  of  a  unique  binary 
classification  problem  with  approximately  equal  numbers  of  points  from  each  class. 

The  classifiers  learned  by  the  IQGME  are  shown  in  Figure  5.5,  and  are  markedly 
different  from  those  learned  by  the  DPGMM-RVM.  Note  that  although  IQGME  per¬ 
forms  classification  in  the  joint  context  and  target  features,  the  illustrated  classihca- 
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FIGURE  5.3:  Results  of  context  identification  using  the  IQGME  model  for  the  first 
synthetic  data  example..  Left:  scatterplot  of  context  features,  with  points  colored 
by  MAP  context;  Right:  similarity  matrix  of  true  and  learned  context  assignments. 


tion  lines  correspond  only  to  the  weights  on  the  target  features.  Although  Contexts 

1,  3,  5,  and  8  illustrate  linear ly-separable  binary  classification  problems,  Contexts 

2,  4,  5,  and  7  consist  of  mostly  points  from  the  H0  class.  The  classifiers  learned  for 
these  contexts  conld  be  highly  over-trained  because  they  do  not  incorporate  much 
information  about  the  Hi  class. 

The  differences  between  the  behavior  of  the  DPGMM-RVM  and  IQGME  can  be 
further  highlighted  through  analysis  of  the  discriminant  weights,  which  are  plotted 
in  Figure  5.6.  The  top  panel  illustrates  the  weights  learned  by  the  DPGMM-RVM 
for  each  context,  and  the  center  panel  illustrates  the  weights  learned  by  the  IQGME. 
The  bottom  plot  shows  the  weights  obtained  from  an  “oracle”  that  trains  a  linear 
RVM  on  each  of  the  context-specific  classification  problems. 

The  weights  learned  by  the  DPGMM-RVM  agree  nearly  perfectly  with  the  oracle 
weights.  Because  the  IQGME  operates  on  the  joint  target  and  context  features,  it 
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FIGURE  5.4:  Component  classifiers  learned  by  the  discriminative  DPGMM-RVM 
model  for  the  first  synthetic  data  example.  Each  panel  illustrates  a  two-dimensional 
scatterplot  of  the  target  features,  corresponding  to  points  from  each  learned  context, 
with  the  decision  boundary  learned  for  each  context  overlaid. 


assigns  weights  to  the  two  context  features  as  well.  The  weights  assigned  to  the  target 
features  of  are  smaller  magnitude  than  the  weights  assigned  by  the  DPGMM-RVM 
and  the  oracle,  and  are  of  similar  magnitude  to  the  weights  assigned  to  the  context 
features.  This  result  suggests  that  the  IQGME  also  found  the  context  features  to  be 
informative  of  class. 

ROC  curves  for  the  DPGMM-RVM,  IQGME,  and  the  oracle  are  plotted  in  Fig¬ 
ure  5.7.  Results  were  evaluated  by  training  and  testing  on  different  sets  of  data  drawn 
from  the  same  context  and  target  feature  distributions.  The  ROC  for  the  DPGMM- 
RVM  is  shown  in  blue,  and  IQGME  is  shown  in  green.  Performance  is  compared 
to  generative  context-dependent  learning  with  the  DPGMM-RVM  (red),  the  oracle 
(black  solid),  and  a  linear  RVM  operating  on  the  target  features  alone  (black  dashed). 


112 


Learned  IQGME  Classifier:  Learned  IQGME  Classifier:  Learned  IQGME  Classifier:  Learned  IQGME  Classifier: 

Context  1  Context  2  Context  3  Context  4 


Learned  IQGME  Classifier:  Learned  IQGME  Classifier:  Learned  IQGME  Classifier:  Learned  IQGME  Classifier: 

Context  5  Context  6  Context  7  Context  8 


FIGURE  5.5:  Component  classifiers  learned  by  the  IQGME  model  for  the  first  syn¬ 
thetic  data  example.  Each  panel  illustrates  a  two-dimensional  scatterplot  of  the 
target  features,  corresponding  to  points  from  each  learned  context,  with  the  decision 
boundary  learned  for  each  context  overlaid. 


The  performance  of  both  the  discriminative  and  generative  DPGMM-RVM  were  sim¬ 
ilar,  with  the  generative  approach  having  slightly  better  performance.  The  IQGME 
did  not  perform  as  well  as  either  DPGMM-RVM;  it  is  likely  that  the  IQGME  was 
overtrained  since  it  learned  classifiers  for  contexts  consisting  of  only  data  from  one 
class. 

Case  2:  Less-Informative  Context  Features 

In  the  second  simulated  data  example,  the  context  features  were  less  informative  since 
the  clusters  overlapped  more  in  the  feature  space.  This  was  achieved  by  increasing 
the  variances  of  each  dimension  in  each  context.  The  distributions  for  Contexts  1 
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Figure  5.6:  Discriminant  weights  learned  by  the  DPGMM-RVM,  IQGME,  and 
the  context  oracle  for  the  first  synthetic  data  example,  colored  by  context.  Top: 
DPGMM-RVM  weights;  Center:  IQGME  weights;  Bottom:  RVM  weights  based  on 
the  context  oracle. 


and  2  had  a  covariance  of  31  and  Contexts  3  and  4  had  a  covariance  of  21.  Figure  5.8 
illustrates  scatterplots  of  the  synthetic  target  and  contextual  features  for  the  second 
synthetic  data  example. 

Figure  5.9  illustrates  the  clustering  results  obtained  from  the  DPGMM-RVM  in 
the  contextual  feature  space,  as  well  as  the  similarity  matrix  between  the  learned 
contexts  and  the  true  context  labels.  In  this  case,  the  DPGMM-RVM  learned  more 
contexts  than  before,  yielding  a  total  of  7.  Most  of  the  data  from  each  of  the  four 
true  contexts  are  split  between  three  or  four  learned  contexts.  This  illustrates  that 
when  less  obvious  clustering  exists  in  the  contextual  features,  the  number  of  contexts 
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FIGURE  5.7:  ROC  curves  comparing  discriminative  context-dependent  learning  on 
the  first  synthetic  data  example.  Performance  is  compared  between  the  DPGMM- 
RVM  (blue),  IQGME  (green),  generative  context-dependent  learning  with  the 
DPGMM-RVM  (red),  linear  RVM  learned  on  target  features  only  (black  dashed), 
and  the  context  oracle  (black  solid). 


FIGURE  5.8:  Scatterplot  of  target  and  context  features  for  the  second  synthetic 
data  example.  Left:  two-dimensional  aggregate  target  feature  space,  with  points 
colored  by  class;  Center:  two-dimensional  context  feature  space,  with  points  colored 
by  context;  Right:  target  features,  split  into  individual  contexts.  [21] 
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Figure  5.9:  Results  of  context  identification  using  the  discriminative  DPGMM- 
RVM  model  for  the  second  synthetic  data  example.  Left:  scatterplot  of  context 
features,  with  points  colored  by  MAP  context;  Right:  similarity  matrix  of  true  and 
learned  context  assignments.  [21] 


learned  by  the  DPGMM-RVM  may  increase. 

The  clustering  results  obtained  from  the  IQGME  are  summarized  by  Figure  5.10. 
A  total  of  11  clusters  were  learned,  and  like  before,  they  appear  to  overlap  in  the 
contextual  feature  space  since  clustering  was  performed  on  the  combined  contextual 
and  target  features.  Furthermore,  the  higher  number  of  clusters  suggests  that  more 
locally-unique  classification  problems  were  learned  from  the  IQGME  than  from  the 
DPGMM-RVM. 

The  component  classifiers  learned  by  the  discriminative  DPGMM-RVM  are  shown 
in  Figure  5.11.  Although  more  contexts  were  learned  in  the  case  of  less-informative 
context  features,  the  DPGMM-RVM  still  finds  linearly-separable  sub-problems  for 
each  context.  It  is  interesting  to  note  that  for  most  of  the  contexts,  the  learned 
decision  boundary  is  either  purely  horizontal  or  vertical.  This  suggests  that  in  these 
contexts,  only  one  target  feature  is  relevant. 
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Figure  5.10:  Results  of  context  identification  using  the  IQGME  model  for  the 
second  synthetic  data  example..  Left:  scatterplot  of  context  features,  with  points 
colored  by  MAP  context;  Right:  similarity  matrix  of  true  and  learned  context  as¬ 
signments. 


The  classifiers  learned  by  the  IQGME  are  shown  in  Figure  5.12.  Recall  that 
IQGME  performs  classification  in  the  joint  context  and  target  feature  space;  for 
visualization  purposes,  the  classification  lines  shown  in  each  panel  are  determined 
by  the  weights  on  the  target  features.  Like  the  first  example,  most  of  the  contexts 
learned  by  the  IQGME  consist  of  data  from  mostly  one  class.  This  is  true  for 
Contexts  1,  4,  5,  9,  and  10.  Based  on  these  results,  the  IQGME  would  be  expected 
to  perform  similarly  as  before. 

The  discriminant  weights  for  the  DPGMM-RVM,  IQGME,  and  oracle  are  shown 
in  Figure  5.13.  Similar  to  the  previous  case,  the  DPGMM-RVM  and  oracle  have 
weights  of  similar  magnitude.  However,  since  more  than  four  contexts  were  learned, 
they  do  not  match  nearly  as  well  as  in  the  Erst  example.  However,  the  IQGME 
shows  similar  performance  as  before,  assigning  small  weight  to  each  of  the  target 
and  context  features. 
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Figure  5.11:  Component  classifiers  learned  by  the  discriminative  DPGMM-RVM 
model  for  the  second  synthetic  data  example.  Each  panel  illustrates  a  two- 
dimensional  scatterplot  of  the  target  features,  corresponding  to  points  from  each 
learned  context,  with  the  decision  boundary  learned  for  each  context  overlaid.  [21] 


ROC  curves  for  the  second  synthetic  data  example  are  shown  in  Figure  5.14. 
In  this  case,  both  discriminative  approaches  outperformed  the  generative  approach. 
This  is  because  the  discriminative  models  learned  contexts  where  classification  could 
be  performed  effectively,  while  the  generative  model  only  sought  to  cluster  the  con¬ 
text  features.  Another  interesting  observation  is  that  both  discriminative  models 
performed  similarly  to  one  another,  suggesting  that  the  IQGME  was  not  as  over¬ 
trained  as  the  larger  number  of  contexts  may  have  suggested. 

Case  3:  Irrelevant  Target  Features 

The  third  simulated  data  example  addresses  performance  when  some  target  features 
are  irrelevant.  This  has  implications  for  buried  threat  detection,  in  which  the  rele- 
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FIGURE  5.12:  Component  classifiers  learned  by  the  IQGME  model  for  the  second 
synthetic  data  example.  Each  panel  illustrates  a  two-dimensional  scatterplot  of  the 
target  features,  corresponding  to  points  from  each  learned  context,  with  the  decision 
boundary  learned  for  each  context  overlaid. 


vance  of  detection  algorithms  may  vary  with  respect  to  environment.  For  this  case 
of  simulated  data,  the  target  features  were  10-dimensional,  only  two  of  which  were 
relevant  in  each  context.  The  two  relevant  features  were  drawn  from  the  same  dis¬ 
tributions  as  in  the  previous  example,  and  the  irrelevant  features  were  drawn  from 
a  Gaussian  distribution  with  zero  mean  and  variance  of  2.  The  first  two  target  fea¬ 
tures  were  relevant  in  Context  1,  the  last  two  were  relevant  in  Context  2,  features 
1  and  10  were  relevant  in  Context  3,  and  features  5  and  6  were  relevant  in  Context 
4.  The  contextual  features  were  drawn  from  the  same  two-dimensional  Gaussian 
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Figure  5.13:  Discriminant  weights  learned  by  the  DPGMM-RVM,  IQGME,  and 
the  context  oracle  for  the  second  synthetic  data  example,  colored  by  context.  Top: 
DPGMM-RVM  weights;  Center:  IQGME  weights;  Bottom:  RVM  weights  based  on 
the  context  oracle.  [21] 


distributions  used  in  the  second  example. 

Figure  5.15  illustrates  the  clustering  results  obtained  from  DPGMM-RVM  in 
the  contextual  feature  space.  The  DPGMM-RVM  performed  similarly  compared  to 
the  previous  example,  learning  six  contexts.  Figure  5.16  illustrates  the  clustering 
results  obtained  from  the  IQGME.  Compared  to  the  previous  example,  the  IQGME 
learned  more  contexts.  A  total  of  17  contexts  were  learned,  and  like  the  previous 
examples,  they  overlapped  heavily  in  the  contextual  feature  space  since  clustering 
was  performed  on  the  joint  context  and  target  features. 

More  differences  between  the  performance  of  the  DPGMM-RVM  and  IQGME  in 
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ROC  -  Discriminative  Context  Learning  with  Overlapping  Context  Features 


FIGURE  5.14:  ROC  curves  comparing  discriminative  context-dependent  learning 
on  the  second  synthetic  data  example.  Performance  is  compared  between  the 
DPGMM-RVM  (blue),  IQGME  (green),  generative  context-dependent  learning  with 
the  DPGMM-RVM  (red),  linear  RVM  learned  on  target  features  only  (black  dashed), 
and  the  context  oracle  (black  solid).  [21] 


the  presence  of  irrelevant  features  can  be  seen  by  analyzing  the  learned  discriminant 
weights,  which  are  plotted  in  Figure  5.17.  The  weights  for  the  DPGMM-RVM  are 
very  similar  to  those  learned  by  the  oracle,  illustrating  that  most  of  the  weights 
for  each  context  are  zero,  and  the  relevant  features  in  each  context  receive  nonzero 
weight.  Meanwhile,  the  IQGME  classifiers  are  not  sparse,  and  most  of  the  target 
and  context  features  receive  a  relatively  small  weight. 

The  ROC  curves  comparing  the  performance  of  the  DPGMM-RVM  and  IQGME 
in  the  presence  of  irrelevant  features  are  provided  in  Figure  5.18.  In  this  case,  the 
DPGMM-RVM  appears  to  be  more  robust  than  the  IQGME  since  it  was  able  to  cor¬ 
rectly  model  the  relevance  of  the  target  features  with  respect  to  context.  These  re¬ 
sults  suggest  that  the  DPGMM-RVM  may  be  a  superior  model  for  context-dependent 
learning  if  different  target  features  are  expected  to  be  irrelevant  under  certain  envi¬ 
ronmental  conditions. 
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Figure  5.15:  Results  of  context  identification  using  the  discriminative  DPGMM- 
RVM  model  for  the  third  synthetic  data  example.  Left:  scatterplot  of  context  fea¬ 
tures,  with  points  colored  by  MAP  context;  Right:  similarity  matrix  of  true  and 
learned  context  assignments. 
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FIGURE  5.16:  Results  of  context  identification  using  the  IQGME  model  for  the  third 
synthetic  data  example.  Left:  scatterplot  of  context  features,  with  points  colored  by 
MAP  context;  Right:  similarity  matrix  of  true  and  learned  context  assignments. 
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Figure  5.17:  Discriminant  weights  learned  by  the  DPGMM-RVM,  IQGME,  and 
the  context  oracle  for  the  third  synthetic  data  example,  colored  by  context.  Top: 
DPGMM-RVM  weights;  Center:  IQGME  weights;  Bottom:  RVM  weights  based  on 
the  context  oracle. 


In  summary,  the  DPGMM-RVM  and  the  IQGME  are  two  similar  approaches  to 
discriminative  context  learning.  However,  their  behavior  on  synthetic  data  highlights 
important  differences  as  to  when  each  is  appropriate  to  use.  The  IQGME  performs 
clustering  and  classification  in  a  common  feature  space,  and  therefore  is  not  amenable 
to  sparse  classifiers.  In  contrast,  the  DPGMM-RVM  performs  clustering  on  the 
contextual  features,  while  also  performing  classification  in  the  features  designed  for 
discriminating  targets. 

The  synthetic  data  examples  showed  that  in  the  case  where  all  features  are  equally 
informative,  and  the  context  features  form  distinct  clusters,  generative  context  learn¬ 
ing  may  be  the  best  approach.  However,  if  the  contextual  features  do  not  cluster  well, 
discriminative  context  learning  can  improve  overall  classification  performance.  The 
final  example  considered  the  case  in  which  some  target  features  were  non-informative, 
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ROC  -  Discriminative  Context-Dependent  Learning  with  Irrelevant  Target  Features 


FIGURE  5.18:  ROC  curves  comparing  discriminative  context-dependent  learning  on 
the  third  synthetic  data  example.  Performance  is  compared  between  the  DPGMM- 
RVM  (blue),  IQGME  (green),  linear  RVM  learned  on  target  features  only  (red  solid), 
linear  RVM  learned  on  both  sets  of  features  together  (red  dashed),  and  the  context 
oracle  (black). 

illustrating  that  the  DPGMM-RVM  can  effectively  learn  the  context-dependent  rel¬ 
evance  of  target  features. 

5.5  Experimental  Results  with  GPR  Data 

The  discriminative  DPGMM-RVM  and  IQGME  were  used  for  context-dependent 
algorithm  fusion  and  evaluated  on  the  GPR  data  set  used  in  Chapters  3  and  4.  The 
target  features  consisted  of  the  prescreener,  EHD,  HMM,  and  SPSCF  confidence 
values.  As  was  done  for  the  generative  DPGMM  in  Chapter  4,  the  context  features 
originally  proposed  in  Chapter  2  were  projected  to  3-D  via  PCA.  The  DPGMM- 
RVM  and  IQGME  discriminative  models  were  trained  using  variational  inference 
with  the  same  hyperparameter  settings  from  the  synthetic  examples.  In  addition, 
the  truncation  level  for  initializing  the  DPGMM-RVM  was  set  to  T  =  30,  and  the 
truncation  level  for  IQGME  was  set  to  T  =  20.  Due  to  the  computational  expense 


124 


of  training  these  models,  both  were  trained  on  a  subset  consisting  of  6,864  alarms 
that  included  all  target  alarms  and  a  3:1  clutter-to-target  ratio. 

5.5.1  Context  Identification  Performance 

The  results  of  context  identification  using  the  DPGMM-RVM  are  summarized  by 
Figure  5.19,  which  illustrates  a  scatterplot  of  the  contextual  features  colored  by 
soil  label  and  by  MAP  contexts  learned  from  the  DPGMM-RVM.  Additionally,  Fig¬ 
ure  5.20  shows  the  similarity  matrix  between  the  soil  labels  and  DPGMM-RVM 
contexts.  Results  illustrate  that  the  DPGMM-RVM  learned  a  total  of  21  contexts. 
This  result  appears  very  similar  to  what  was  obtained  from  the  generative  DPGMM 
in  Chapter  4,  which  19  contexts  as  shown  in  Figures  4.5  and  4.6.  Similarities  between 
generative  and  discriminative  context  learning  include  that  the  largest  contexts  con¬ 
tain  mostly  dirt  observations,  and  that  contexts  composed  of  mostly  asphalt  and 
concrete  data  are  distinct  from  those  composed  of  mostly  dirt  and  gravel.  Another 
similarity  is  that  gravel  data  held  a  majority  in  only  a  few  contexts  (Contexts  4  and 
13),  while  holding  a  large  minority  of  the  population  of  many  other  contexts. 

The  IQGME  behaved  differently  on  the  GPR  features  than  it  did  in  the  synthetic 
data  example.  The  IQGME  identified  fewer  contexts  than  the  DPGMM-RVM,  yield¬ 
ing  13  contexts  total.  The  scatterplots  comparing  the  learned  IQGME  contexts  to 
the  known  soil  labels  are  shown  in  Figure  5.21,  and  the  similarity  matrix  is  shown 
in  Figure  5.22.  The  scatterplot  shows  significant  overlap  of  the  context  assignments, 
as  it  did  in  the  synthetic  examples.  The  vast  majority  of  the  data  fall  under  Con¬ 
texts  1  and  2,  suggesting  that  the  “typical”  classification  problem  lies  in  these  large 
contexts. 

The  similarity  matrix  shown  in  Figure  5.23  compares  the  context  identification 
performance  both  the  DPGMM-RVM  and  the  IQGME.  Because  both  techniques 
identified  a  large  number  of  contexts,  it  was  difficult  to  visually  compare  the  context 


125 


Known  Soil  Labels 


Dlscrim.  DPGMM-RVM  Context  Learning 

•  Context  1 

■  Context  2 

A  Context  3 

•  Context  4 

▼  Context  5 

•  Context  6 

■  Context  7 

A  Context  8 

•  Context  9 

▼  Context  10 

•  Context  11 

■  Context  12 

A  Context  13 

•  Context  14 

▼  Context  15 

•  Context  16 

■  Context  17 

A  Context  18 

•  Context  19 

▼  Context  20 

•  Context  21 


FIGURE  5.19:  Scatterplot  comparing  results  of  context  learning  using  the  discrimi¬ 
native  DPGMM-RVM  on  the  GPR  contextual  features  to  the  known  soil  labels.  Left: 
Scatter  plot  of  3-D  PCA  projection  of  contextual  features,  with  points  colored  by 
qualitative  soil  label.  Right:  Same  scatter  plot,  but  with  points  colored  by  MAP 
mixture  component.  [21] 
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FIGURE  5.20:  Similarity  matrix  comparing  DPGMM-RVM  clustering  results  to  the 
known  soil  labels. 
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FIGURE  5.21:  Scatterplot  comparing  results  of  context  learning  using  IQGME  on 
the  GPR  contextual  features  to  the  known  soil  labels.  Left:  Scatter  plot  of  3-D  PGA 
projection  of  contextual  features,  with  points  colored  by  qualitative  soil  label.  Right: 
Same  scatter  plot,  but  with  points  colored  by  MAP  mixture  component. 
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Similarity  Matrix  -  DPGMM-RVM  vs.  IQGME  Context  Learning 
Adjusted  Mutual  Information  (AMI)  =  0.30022 
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Figure  5.23:  Similarity  matrix  comparing  IQGME  context  identification  to 
DPGMM-RVM  context  identification.  The  horizontal  axis  represents  the  DPGMM- 
RVM  contexts,  and  the  vertical  axis  represents  the  IQGME  contexts.  The  AMI  of 
the  two  clusterings  [110]  is  shown  at  top.  [21] 

assignments  from  the  scatterplots  in  Figures  5.19  and  5.21.  Instead,  the  adjusted 
mutual  information  (AMI)  [110]  was  used  to  compare  the  results  of  context  iden¬ 
tification.  The  AMI  can  be  used  to  compare  two  clusterings,  each  having  different 
numbers  of  clusters,  while  correcting  for  the  effect  of  chance  agreement.  The  range 
of  AMI  is  between  zero  and  one;  an  AMI  of  one  would  be  obtained  for  two  identical 
clusterings,  and  an  AMI  of  zero  would  be  obtained  for  two  clusterings  with  only 
chance  similarity.  The  AMI  between  the  contexts  identified  by  the  discriminative 
DPGMM-RVM  and  IQGME  was  0.3002.  Although  there  appears  to  be  strong  over¬ 
lap  between  IQGME  Contexts  1  and  2  and  DPGMM-RVM  Contexts  8,  12,  16,  and 
18,  which  contain  the  majority  of  observations  and  mostly  correspond  to  the  dirt 
soil  type,  the  low  AMI  metric  suggests  that  little  information  is  shared  between  the 
clusterings.  However,  based  on  results  from  the  synthetic  data  examples,  the  low 
degree  of  similarity  between  the  DPGMM-RVM  and  IQGME  contexts  was  expected. 
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5.5.2  Context-Dependent  Fusion  Results 


The  discriminant  weights  learned  for  the  DPGMM-RVM  and  IQGME  are  shown 
in  Figure  5.24.  The  DPGMM-RVM  weights  are  shown  in  the  top  panel,  and  the 
IQGME  weights  are  shown  at  the  bottom.  The  first  four  dimensions  are  the  target 
features,  and  in  the  case  of  IQGME,  the  last  three  are  the  context  features.  For  the 
DPGMM-RVM,  the  weights  on  the  feature  values  (not  the  bias)  are  mostly  either 
positive  or  zero.  The  one  exception  to  this  is  the  prescreener  weight  in  Context  8. 
Therefore,  the  DPGMM-RVM  weights  could  be  interpreted  as  each  algorithm  being 
either  relied  upon  or  ignored  in  each  context,  and  only  rarely  discounted. 

However,  the  IQGME  weights  for  the  target  features  appear  somewhat  evenly 
distributed  around  zero;  some  are  positive,  and  others  are  negative.  This  suggests 
that  the  local  classification  problems  discovered  by  IQGME  are  substantially  different 
than  those  found  by  the  DPGMM-RVM,  and  the  negative  weights  will  cause  fusion  to 
discount  certain  algorithms’  confidences  for  some  contexts.  Therefore,  fusion  would 
tend  to  make  a  decision  opposite  of  what  the  negatively-weighted  algorithms  may 
indicate  in  those  contexts. 

5.5.3  Detection  Performance 

The  discriminative  context-dependent  fusion  techniques  were  evaluated  using  the 
same  cross-validation  folds  that  were  used  to  compute  the  ROC  curves  presented  in 
Chapters  3  and  4.  Both  discriminative  learning  techniques,  the  DPGMM-RVM  and 
IQGME,  were  evaluated  and  compared  to  the  generative  DPGMM-RVM  presented 
in  Chapter  4  as  well  as  conventional  fusion  with  a  linear  RVM. 

Figure  5.25  illustrates  the  ROC  curves  obtained  for  each  of  the  fusion  approaches 
that  were  evaluated,  as  well  as  the  prescreener,  EHD,  SPSCF  and  HMM  algorithms. 
The  global  RVM  curve,  shown  by  the  black  solid  line,  is  plotted  along  with  a  shaded 
region  indicating  the  90%  confidence  region.  The  ROC  curve  for  the  generative 
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Discriminant  Weights  -  DPGMM-RVM 


Discriminant  Weights  -  IQGME 
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Figure  5.24:  Discriminant  weights  learned  by  the  DPGMM-RVM  and  IQGME 
for  algorithm  fusion  on  the  GPR  data  set.  Top:  DPGMM-RVM  weights;  Bottom: 
IQGME  weights. 


DPGMM-RVM,  originally  shown  as  the  red  line  in  Figure  4.15,  is  plotted  for  ref¬ 
erence.  The  discriminative  DPGMM-RVM  is  shown  by  the  green  line,  and  the  dis¬ 
criminative  IQGME  by  the  blue  line.  Results  show  that  all  three  context-dependent 
fusion  techniques  yield  significantly  better  performance  than  the  single  RVM.  The 
discriminative  context-dependent  fusion  techniques  both  show  a  lower  FAR  than  the 
generative  technique  at  low  PD.  The  three  ROC  curves  cross  around  PD=0.65.  From 
0.65  <  PD  <  0.85,  the  generative  context-dependent  approach  has  the  best  perfor¬ 
mance.  At  higher  PD,  the  performance  of  generative  context-dependent  fusion  and 
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Figure  5.25:  ROC  curves  for  discriminative  context-dependent  fusion  using 
the  IQGME  (blue)  and  DPGMM-RVM  (green)  compared  to  generative  context- 
dependent  fusion  (red),  non- context-dependent  RVM  fusion  (black  dashed),  and  the 
individual  fused  algorithms  (dotted).  The  ROC  consists  of  PD  versus  FAR,  measured 
in  false  alarms  per  square  meter,  as  a  function  of  decision  threshold.  [21] 


the  discriminative  IQGME  are  similar,  while  the  discriminative  DPGMM-RVM  is 
not  significantly  better  than  the  single  RVM. 

Of  the  three  synthetic  data  examples,  the  results  presented  in  Figure  5.25  appear 
to  be  most  similar  to  the  first  case.  Although  it  was  expected  that  discriminative 
context-dependent  fusion  would  yield  the  best  performance,  neither  approach  out¬ 
performed  the  generative  context-dependent  fusion  technique  presented  in  Chapter  4. 
However,  it  is  interesting  to  see  that  both  discriminative  approaches  performed  sim- 
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ilarly  despite  incorporating  contextual  information  in  different  ways.  Furthermore, 
the  similarity  in  performance  between  both  discriminative  approaches  suggests  that 
the  target  features  are  relevant  across  all  contexts.  Therefore,  enforcing  sparseness  in 
the  DPGMM-RVM  discriminant  weights  did  not  improve  performance.  The  superior 
performance  of  the  generative  approach  suggests  that  the  context  features  already 
cluster  with  respect  to  relevant  contextual  factors,  and  discriminative  context  learn¬ 
ing  may  not  be  necessary. 

5.6  Conclusion 

In  this  chapter,  two  potential  methods  for  discriminative  context  learning  were  pre¬ 
sented.  The  first  approach,  referred  to  as  the  discriminative  DPGMM-RVM,  was 
based  upon  the  generative  techniques  presented  in  the  previous  chapter  but  was 
learned  based  on  the  joint  likelihood  of  the  contextual  features  and  class  labels.  The 
second  approach,  the  IQGME,  is  similar  in  that  the  gating  network  is  based  on  the 
DPGMM.  However,  the  local  experts  are  not  sparse,  and  classification  and  clustering 
are  performed  on  the  joint  target  and  context  features. 

Several  examples  using  synthetic  data  were  used  to  illustrate  the  differences  in 
behavior  between  the  two  discriminative  context  learning  approaches.  The  first  ex¬ 
ample  considered  two-dimensional  context  and  target  features,  in  which  all  were 
informative.  Comparison  of  the  DPGMM-RVM  and  IQGME  showed  similarities  in 
classification  performance,  although  the  contexts  that  were  learned  were  substan¬ 
tially  different.  The  discriminant  weights  learned  for  the  IQGME  were  smaller  in 
magnitude  than  those  learned  for  the  DPGMM-RVM,  since  the  IQGME  also  in¬ 
corporated  context  features.  Furthermore,  the  IQGME  appeared  over-trained  since 
it  learned  contexts  consisting  of  only  one  class  of  data.  Therefore,  the  DPGMM- 
RVM  led  to  better  performance,  although  generative  context-dependent  learning 
performed  slightly  better.  The  second  example  was  similar  to  the  first,  with  the  only 
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difference  being  that  the  context  features  were  less  informative.  In  this  case,  both  dis¬ 
criminative  context-dependent  classifiers  performed  similarly  and  yielded  substantial 
performance  improvements  over  generative  context-dependent  learning. 

The  third  synthetic  data  example  considered  the  case  in  which  some  target  fea¬ 
tures  were  irrelevant,  depending  on  the  context.  The  DPGMM-RVM  accurately 
identified  the  features  that  were  relevant  in  each  context.  The  IQGME  yielded  a 
model  that  was  much  less  sparse,  in  terms  of  the  number  of  learned  contexts,  than 
the  DPGMM-RVM.  In  this  example,  the  DPGMM-RVM  achieved  performance  gains 
over  the  IQGME  due  to  its  ability  to  learn  which  features  were  relevant  in  which 
context. 

For  experiments  with  GPR  data,  it  was  expected  that  discriminative  context- 
dependent  learning  would  yield  results  similar  to  the  second  and  third  examples. 
However,  experimental  results  appear  to  be  similar  to  the  first  synthetic  example. 
The  similarity  in  performance  of  both  discriminative  techniques  suggest  that  all  of 
the  target  features  may  be  relevant  across  contexts.  Furthermore,  the  fact  that 
generative  context-dependent  learning  yielded  better  performance  suggests  that  the 
proposed  features  are  very  informative  of  the  underlying  contextual  factors,  and  that 
incorporating  more  information  through  discriminative  context  learning  may  not  be 
necessary. 

Additional  sources  of  contextual  information  should  still  be  considered  for  im¬ 
proving  performance.  One  potential  source  is  the  spatial  distribution  of  the  context 
features.  The  context  learning  techniques  presented  up  to  this  point  considered 
individual  prescreener  alarms  as  statistically  independent  observations.  However,  a 
wealth  of  contextual  information  may  be  available  in  the  large  stretches  of  target-free 
data  collected  between  prescreener  alarms.  By  regularly  sampling  the  background 
to  extract  contextual  features,  spatially-distributed  contextual  factors  may  be  dis¬ 
covered.  This  information  can  be  valuable  for  inferring  the  underlying  context  well 
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before  a  prescreener  alarm  is  recorded.  The  following  chapter  investigates  two  tech¬ 
niques  for  achieving  this  goal  through  nonparametric  spatial  context  modeling. 
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6 


Nonparametric  Spatial  Context  Models 


In  military  route  clearance  applications,  a  vehicular  GPR  system  such  as  the  NI- 
ITEK  HMDS  may  lead  a  convoy  over  many  kilometers  through  varying  terrain  while 
searching  for  buried  explosive  threats.  In  Chapters  3-5,  several  context  learning 
techniques  were  proposed  for  exploiting  information  regarding  terrain  differences  to 
improve  the  detection  performance  achieved  by  algorithm  fusion.  These  techniques 
utilized  contextual  information  extracted  near  recorded  prescreener  alarms,  and  all 
alarms  were  treated  as  independent  observations.  In  practice,  it  may  be  more  advan¬ 
tageous  to  regularly  extract  contextual  features  from  the  background,  and  utilize  the 
spatial  dependency  of  observations  for  better  inference  of  the  underlying  context. 

This  chapter  proposes  two  methods  for  nonparametric  spatial  context  modeling. 
While  the  previously-discussed  context  models  operated  on  an  alarm-by-alarm  basis, 
the  models  proposed  in  this  chapter  are  used  to  infer  context  as  a  function  of  space. 
This  is  achieved  by  extracting  contextual  features  at  regular  downtrack  intervals  and 
performing  inference  on  each  sample.  The  first  context  model  that  will  be  considered 
is  the  DPGMM,  which  was  originally  presented  for  generative  alarm-based  context 
learning  in  Chapter  4.  The  second  model  to  be  considered  is  the  stick-breaking 
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hidden  Markov  model  (SBHMM),  which  is  a  nonparametric  extension  of  the  HMM 
originally  proposed  by  Paisley  and  Carin  [69].  Like  the  DPGMM,  the  SBHMM 
employs  a  stick-breaking  prior  to  facilitate  learning  of  the  model’s  order.  However, 
unlike  the  DPGMM  which  assumes  all  samples  are  independent,  the  SBHMM  context 
model  allows  for  spatial  dependency  between  samples. 

6.1  Spatial  Context  Sampling 

The  context  modeling  approaches  proposed  in  this  chapter  utilize  contextual  features 
extracted  from  the  background  at  regular  intervals  over  a  given  area.  The  feature 
extraction  process  is  referred  to  as  context  sampling.  There  are  several  reasons  for 
using  context  sampling  as  opposed  to  extracting  features  from  prescreener  alarms. 
The  primary  reason  is  that  in  route  clearance  patrols,  the  vast  majority  of  GPR  data 
collected  in  the  field  will  be  free  of  buried  threats.  In  current  processing  strategies, 
the  large  stretches  of  background  data  are  generally  ignored  after  prescreening  [41] . 
Although  this  background  data  may  be  target-free,  it  could  potentially  be  rich  in 
contextual  information.  Another  reason  to  motivate  context  sampling  is  that  certain 
contextual  factors  may  be  spatially-distributed.  For  example,  consider  a  desert  gulch, 
a  local  region  of  low  elevation  where  moisture  may  accumulate  in  the  event  of  a  flash 
flood.  It  would  be  expected  that  the  soil  in  a  recently  washed-out  area  may  contain 
more  moisture  than  surrounding  areas  at  higher  elevations. 

Consider  the  example  shown  in  Figure  6.1.  The  top  panel  illustrates  raw  GPR 
data  collected  on  a  concrete  test  lane,  and  the  anomalies  occurring  around  time  sam¬ 
ple  200  correspond  to  landmine  signatures.  In  the  late-time  portion  of  the  B-scan,  a 
faint  subsurface  layer  emerges  around  downtrack  sample  1000  and  becomes  stronger 
around  downtrack  sample  2800.  A  second  subsurface  layer  appears  around  down- 
track  sample  4300.  These  distinct  regions  characterized  by  different  subsurface  layer 
responses  could  possibly  correspond  to  unique,  spatial  context  regions  as  illustrated 
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GPR  Data:  Concrete  Lane 
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Figure  6.1:  Example  of  GPR  data  collected  on  a  concrete  lane  and  apparent  spatial 
context  regions.  Top:  raw  GPR  data,  Bottom:  raw  GPR  data  with  apparent  spatial 
contexts  indicated  by  different  shaded  regions.  Downtrack  position  is  represented  by 
the  horizontal  axis  in  both  panels. 


by  the  bottom  panel. 

In  this  work,  context  sampling  was  performed  by  extracting  the  contextual  fea¬ 
tures  proposed  in  Chapter  2  from  the  background  at  regular  10  cm  intervals.  The 
sequence  of  background  features  is  denoted  by  X(<:)  =  [xp\  x^, ...,  x^],  where 
N  is  the  length  of  the  sequence.  Although  features  were  extracted  from  this  data 
off-line,  it  is  understood  that  real-time  implementation  will  be  necessary  in  fielded 
applications.  Therefore,  the  sampling  interval  may  need  to  be  increased  to  facilitate 
real-time  processing.  Furthermore,  context  features  were  only  extracted  from  the 
center  channel  (channel  24)  of  the  GPR  array.  Although  more  contextual  informa¬ 
tion  could  potentially  be  exploited  by  sampling  the  other  channels,  incorporating 
features  from  the  other  channels  did  not  improve  performance.  In  a  similar  vein 
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to  previously-discussed  context  modeling  techniques,  the  features  were  projected  to 
3-D  via  PCA.  Since  PCA  implies  an  underlying  Gaussian  distribution,  it  facilitated 
training  spatial  context  models  based  on  Gaussian  distributions.  Furthermore,  three 
principal  components  were  used  because  the  alarm-based  DPGMM  context  model 
also  performed  best  using  the  same  number  of  components. 

Figure  6.2  illustrates  a  zoomed-in  portion  of  the  GPR  data  shown  in  Figure  6.1 
to  illustrate  the  context  sampling  interval  and  how  the  background  features  are 
illustrative  of  contextual  transitions.  The  top  panel  illustrates  the  portion  of  the 
lane  where  the  early-time  subsurface  layer  appears  around  downtrack  sample  4275. 
The  dashed  lines  represent  the  downtrack  samples  from  where  context  features  were 
extracted  from  the  background.  In  the  bottom  plot,  the  contextual  shift  is  reflected 
by  a  change  in  the  values  of  the  second  principal  component  of  the  context  features. 
It  appears  that  there  is  a  latency  of  about  30  samples  from  where  the  shift  occurs 
and  where  the  feature  values  change.  This  is  likely  due  to  the  fact  that  features  are 
extracted  causally,  using  the  100  A-scans  preceding  each  sample  point. 

After  the  feature  sequence  is  extracted  from  the  background,  it  is  processed  by  a 
statistical  context  model.  The  context  model  yields  posterior  context  probabilities, 
p(cnm  =  l|xiC)),  for  each  sample  (xi^)  for  n  =  1,  2...,  N.  If  a  prescreener  alarm  falls 
between  two  samples,  it  is  associated  with  the  context  posterior  of  the  earlier  sample. 
Context  posteriors  for  several  distinct  test  lanes  are  illustrated  in  the  experimental 
results  presented  in  Section  6.4.1. 

In  this  chapter,  two  spatial  context  models  are  proposed.  Both  models  were 
learned  using  the  generative  approach.  The  first  is  an  extension  of  the  DPGMM 
originally  presented  in  Chapter  4.  The  second  is  based  upon  the  SBHMM,  origi¬ 
nally  developed  by  Paisley  and  Carin  [69].  While  the  DPGMM  approach  to  context 
modeling  treats  all  samples  as  independent  observations,  the  SBHMM  exploits  de¬ 
pendencies  between  neighboring  samples.  The  two  context  models  are  described  in 
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GPR  Data:  Concrete  Lane  (Zoomed  In) 
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FIGURE  6.2:  GPR  data  from  Figure  6.1,  zoomed  in  to  illustrate  background  sampling 
near  a  contextual  shift.  Top:  raw  GPR  data,  with  feature  extraction  locations  noted 
by  dashed  lines,  Bottom:  3-D  PGA  of  background  contextual  features.  Downtrack 
position  is  represented  by  the  horizontal  axis  in  both  panels. 


greater  detail  in  the  following  sections. 

6.2  DPGMM  Spatial  Context  Model 

The  DPGMM  was  proposed  in  Chapter  4  for  generative  context  learning,  and  was 
also  utilized  in  Chapter  5  as  the  gating  network  for  discriminative  context  learning. 
In  both  cases,  the  DPGMM  was  used  to  model  the  distribution  of  context  features 
corresponding  to  prescreener  alarms.  In  the  case  of  spatial  context  modeling,  the 
DPGMM  was  trained  on  the  three-dimensional  PCA  projection  of  the  contextual 
feature  sequence  (X(<A)  extracted  from  regular  background  samples. 

Refer  to  Section  4.3  for  a  description  of  the  DPGMM  generative  model  and  likeli¬ 
hood  functions.  Details  on  VB  inference  of  the  model  parameters  that  was  developed 
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for  this  application  can  be  found  in  Appendix  C.  The  DPGMM  was  learned  using 
the  same  hyperparameter  settings  as  in  Chapter  4:  uq  =  1,  Tio  =  T20  =  1,  Vo  —  D^c\ 
B0  =  ID(c),  and  p0  was  set  equal  to  the  sample  mean  of  X(rh  Note  that  since 
PCA  is  being  used  on  the  features,  ZTC)  =  3.  Additionally,  the  truncation  level  T 
was  set  to  30.  The  only  major  difference  in  implementation  was  the  cluster  prun¬ 
ing  criterion,  which  was  set  at  5%  to  prevent  too  many  small  contexts  from  being 
learned. 

Recall  the  DPGMM  likelihood  function  given  by  (4.27)  and  its  data-generating 
process;  by  modeling  context  as  the  latent  variable  governing  draws  from  a  mixture 
of  Gaussians,  observations  are  treated  as  statistically  independent.  In  terms  of  the 
Chinese  restaurant  process,  which  was  described  in  Section  4.2,  each  customer  selects 
a  table  based  only  on  the  number  of  people  seated  at  each  table  and  not  necessarily 
what  the  previous  customer’s  choice  was. 

However,  it  was  mentioned  earlier  that  certain  contextual  factors  may  be  spatially- 
distributed.  The  spatial  dependency  between  feature  samples  may  be  a  useful  source 
of  contextual  information.  The  following  section  proposes  using  a  nonparametric 
variant  of  the  HMM  to  model  context  as  a  spatially-varying  state  underlying  the 
background  features  X(Cd 

6.3  SBHMM  Spatial  Context  Model 

The  HMM  is  a  popular  choice  for  modeling  time  series  that  are  dependent  on  an  un¬ 
derlying  state  variable  that  is  not  directly  observed,  but  can  be  inferred  from  data. 
While  most  notably  used  in  speech  recognition  applications  [111],  HMMs  have  also 
been  explored  for  modeling  polyphonic  music  recordings  [112,113],  speaker  diariza- 
tion  [114],  handwriting  recognition  [115,116],  acoustic  sensing  [117],  and  landmine 
detection  [42,48]. 

The  HMM  follows  the  structure  of  a  Markov  chain,  in  which  a  data  sequence 
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X  =  [x1;  x2, xjv]  is  assumed  to  be  in  one  of  M  states  at  a  given  index  n,  i.e., 
sn  G  {Si,  S2,  Sm},  where  M  is  the  order  of  the  model.  The  HMM  incorporates 
a  degree  of  statistical  dependency  between  observations  in  the  sequence  through 
the  Markov  property ,  which  states  that  the  state  of  the  current  observation  is  only 
dependent  on  the  state  of  the  previous  observation: 

P(^n+ 1  ^ j  1  1)  ®l)  P(^n+1  ^m\^n  k>j ) .  (6.1) 

The  “hidden”  aspect  of  an  HMM  is  that  the  underlying  state  is  treated  as  an 
unknown  latent  variable.  However,  the  state  sequence  can  be  inferred  from  X  given 
the  model  parameters,  {77,  A,©}.  The  Mxl  vector,  77,  consists  of  the  initial  state 
probabilities ,  which  are  given  by 

=  p{si  =  Sm),  m  =  1,  2, ...,  M,  (6.2) 

and  satisfy  the  following  properties: 

0  <  7rm  <  1  (6.3) 

M 

^  7 Tm  =  1  (6.4) 

m—  1 

The  M  x  M  matrix,  A,  consists  of  the  state  transition  probabilities  which  are  given 
by 

^ mj  P(Sn-\- 1  | $ n  Sj ) 5  ^  1 5  2,  ... ,  j  1,2,...,  M ,  (6.5) 

are  assumed  to  be  constant  with  respect  to  time,  and  satisfy  the  following  properties: 

0  <  amj  <  1  (6.6) 

M 

y  amj  =  1  (6.7) 

3= 1 
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Finally,  6,  are  the  parameters  for  the  emission  densities,  p(xn\sn  =  Sm).  There  is 
no  restriction  on  the  form  of  the  emission  densities,  the  only  requirement  being  that 
they  are  valid  PDFs. 

The  conventional  method  for  learning  the  parameters  of  an  HMM  is  via  the  Baum- 
Welch  algorithm,  which  performs  maximum-likelihood  estimation  for  a  model  with 
fixed  order  (i.e.,  known  number  of  states).  Given  a  trained  HMM,  the  Viterbi  algo¬ 
rithm  can  then  be  used  for  calculating  the  most  probable  state  sequence  for  a  given 
observation  sequence.  Details  regarding  the  Baum- Welch  and  Viterbi  algorithms  can 
be  found  in  [111],  while  implementation  details  are  discussed  in  [118]. 

Earlier  work  suggested  modeling  context  in  GPR  data  as  a  spatially-dependent 
state  variable  using  an  HMM  of  fixed  order  [119].  While  a  spatially-dependent 
HMM  context  model  showed  potential  for  improvement  over  alarm-based  context- 
dependent  fusion,  performance  varied  significantly  with  respect  to  the  number  of 
states  (contexts)  being  considered.  Like  GMMs,  HMMs  are  susceptible  to  over-  or 
under-training  if  the  model  order  is  specified  incorrectly.  A  poorly-trained  context 
model  can  then  lead  to  poor  performance  in  context-dependent  fusion.  Fortunately, 
the  DP  offers  a  potential  solution  to  this  problem  as  it  did  in  the  case  of  the  GMM 
context  model. 

Since  the  elements  of  7V  and  the  rows  of  A  are  constrained  to  sum  to  one,  they  can 
be  treated  as  parameters  of  a  multinomial  distribution  from  which  the  underlying 
state  is  drawn  at  any  given  point  in  the  sequence,  X.  If  an  HMM  is  assumed  to  be 
infinite- order,  the  DP  can  be  used  as  a  sparseness- promoting  prior  on  the  number  of 
states  since  it  is  conjugate  to  the  multinomial  distributions  parameterized  by  7r  and 
A.  Several  methods  for  incorporating  DP  priors  into  HMM  inference  rely  on  Markov 
chain  Monte  Carlo  (MCMC)  sampling  to  approximate  the  posterior  probabilities 
[114,120].  To  maintain  consistency  with  the  VB  techniques  used  in  the  previous 
chapters,  this  work  utilizes  the  VB  approach  based  on  the  stick-breaking  construction 
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as  proposed  in  [69].  This  model  is  referred  to  as  the  stick- breaking  HMM  (SBHMM). 

The  SBHMM  imposes  a  stick-breaking  prior  on  the  rows  of  A  as  well  as  on  7v 
to  facilitate  learning  an  effective  number  of  states  given  the  training  sequences.  The 
priors  are  given  by: 

3- 1 

amj  =  vij  JJ(1  -  v*k)  (6.8) 

k= 1 

m—  1 

*m  =  <  ]^[  (X  ^  Vk)  (6-9) 

k=  1 

where 


i>y  ~  Beta(l,  a^j) 

(6.10) 

~  Beta(l,o£J 

(6.11) 

Recall  the  discussion  regarding  stick-breaking  priors  from  Section  4.2.  If  the 
distribution  G  is  drawn  from  a  stick-breaking  process,  model  parameters  drawn  from 
G  will  take  on  distinct  values  9*,  j  =  l,2,...,oo,  and  G  therefore  translates  to 
the  discrete  density  given  by  (4.26).  In  the  case  of  the  HMM,  the  latent  variable 
governing  the  mixture  proportions  corresponds  to  the  underlying  state.  Therefore, 
imposing  the  stick-breaking  prior  on  the  state  transition  probabilities  assumes  that 
G  is  state-dependent  and  each  state  shares  the  same  9*,  such  that, 


GJm)  = 


>1}  i  ~A;.  if  n=  1 

j— i  if  U  >  1 


Vm  =  1, 2, ....,  oo 


(6.12) 


VB  inference  can  be  performed  on  the  SBHMM  by  assuming  a  truncation  level 
T  on  the  number  of  states  and  conjugate  priors  on  all  model  parameters,  including 
the  parameters  of  the  emission  densities.  In  this  work,  the  emission  densities  were 
treated  as  multivariate  Gaussian  with  unknown  mean  and  covariance,  and  therefore 
have  Normal-Wishart  priors.  Therefore,  the  SBHMM  used  as  a  context  model  in 
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this  work  is  very  similar  to  the  DPGMM  proposed  in  the  previous  section,  with 
the  Markov  property  being  the  only  major  difference  between  the  two.  The  data- 
generating  process  for  the  SBHMM  used  in  this  work  is  as  follows: 


1.  For  m  =  1,  2, T 


(a)  Draw  a^n  rs_/  Gamma  (co,  do) 

(b)  Draw  ~  Beta  (1,  < J 

(c)  Calculate  initial  state  probabilities  7Tm  =  <  IE1  (1  -  vl) 

(d)  For  j  =  1,2 

i.  Draw  a^tJ  rs-/  Gamma  (co,  do) 

ii.  Draw  v^\ ~  Beta  (l ,a^) 

iii.  Calculate  state  transition  probabilities  am]  =  <  mb  (i  Gife) 

(e)  Draw  0*m\G0  ~  U  W  (A^|B0,  i/0) 

2.  For  n  —  1,  2, ...,  N 


I  Multinomial  (77) ,  if  n  —  1 

(a)  Draw  indicator  variable  sra  ~  < 

I  Multinomial  (aSn_1J  if  n  >  1 

(b)  Draw  data  x^C '  |  snm  —  1  ~  J\f  1 1 0* ,  j ,  m  =  1 ,  2, . .. ,  M 

The  SBHMM  emission  densities  were  initialized  into  T  —  30  clusters  using  k- 
means.  The  following  hyperparameter  settings  were  used  for  all  experiments  in  this 
chapter,  as  recommended  in  [69]:  uo  —  1,  Co  =  1CT6,  do  =  0.1,  uq  =  D^c\  B0  = 
-D^Ioco,  and  p0  was  set  equal  to  the  sample  mean  of  X(f  \ 

The  following  synthetic  data  example  illustrates  the  performance  of  the  SBHMM 
in  modeling  synthetic  data.  Figure  6.3  illustrates  the  parameters  of  an  four-state 
HMM  from  which  100  sequences  of  length  25  were  drawn.  The  emission  densities  are 
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Figure  6.3:  True  parameters  for  the  SBHMM  synthetic  data  example.  The  top 
panel  illustrates  the  emission  densities,  the  bottom-left  panel  illustrates  the  initial 
state  probabilities,  and  the  bottom-right  panel  illustrates  the  state  transition  prob¬ 
ability  matrix. 


four  Gaussian  distributions  with  means  of  -12,  -4,  4,  and  12  with  unit  variance.  The 
initial  state  probabilities  are  uniform,  i.e.  n m  =  0.25,  for  m  =  1,2,  3, 4.  Finally,  the 
state  transition  matrix  shows  no  probability  of  remaining  in  any  given  state  -  each 
state  has  equal  probability  of  transitioning  to  one  of  two  other  states. 

The  SBHMM  was  learned  using  VB  inference  with  a  NFE  convergence  thresh¬ 
old  of  10-4.  After  convergence,  states  with  too  few  samples  were  eliminated.  The 
expected  number  of  state  transitions  (denoted  as  A  =  {a^})  was  calculated  from 
the  variational  posteriors  on  A  to  yield  the  top  panel  of  Figure  6.4.  To  calculate 
the  expected  overall  state  occupancy,  the  columns  of  A  were  summed  to  yield  the 
values  shown  in  the  bottom  panel  of  Figure  6.4.  All  states  with  an  occupancy  of 
less  than  1%  were  pruned  from  the  model,  and  the  remaining  initial  and  transition 
probabilities  were  renormalized  to  sum  to  one. 

The  model  parameters  which  remained  after  pruning  are  shown  in  Figure  6.5. 
In  this  example,  all  of  the  true  parameters  were  approximated  very  closely.  By 
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FIGURE  6.4:  Illustration  of  state  pruning  from  converged  SBHMM.  The  top  panel 
illustrates  the  expected  number  of  state  transitions  as  calculated  from  the  variational 
posteriors.  The  bottom  panel  illustrates  the  expected  state  occupancy,  with  the  1% 
occupancy  threshold  shown. 

comparing  these  results  to  the  true  HMM  parameters  shown  in  Figure  6.3,  it  is 
clear  that  the  learned  State  1  corresponds  to  the  true  State  2,  the  learned  State  2 
corresponds  to  the  true  State  1,  the  learned  State  3  corresponds  to  the  true  State  4, 
and  the  learned  State  4  corresponds  to  the  true  State  3. 

For  use  as  a  GPR  context  model,  an  SBHMM  was  trained  on  sequences  of  PCA- 
projected  background  features  (D^  =  3)  using  VB  inference  with  the  same  hyper¬ 
parameter  settings  as  in  the  synthetic  data  example.  After  learning  converged  to 
a  solution  and  extraneous  states  were  pruned  with  a  5%  occupancy  criterion,  the 
causal  state  posteriors  at  each  downtrack  position  are  given  by  the  forward  variable, 
a: 


1|tt,  A,©j  , 


(6.13) 
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Figure  6.5:  Learned  parameters  for  the  SBHMM  synthetic  data  example.  The  top 
panel  illustrates  the  learned  emission  densities,  the  bottom-left  panel  illustrates  the 
learned  initial  state  probabilities,  and  the  bottom-right  panel  illustrates  the  learned 
state  transition  probability  matrix. 


where  a  is  computed  recursively  via  the  following: 


,  (C)| 

W  x)  \snm 


=  1 


(6.14) 


1  (^) 


'  M 

_  771=1 


P 


\s 

1 An+lP«m 


(6.15) 


The  forward  variable  allows  for  the  context  of  a  given  downtrack  position  to  be 
computed  using  only  the  current  and  prior  samples.  Although  the  Markov  property 
assumes  that  the  state  of  a  given  sample  is  only  dependent  on  the  state  of  the 
previous  sample,  the  recursive  update  allows  for  spatial  dependency  to  be  a  factor 
in  determining  the  context  posterior  of  any  location  in  the  background  sequence. 
When  a  prescreener  alarm  is  encountered  on  the  lane  at  location  n,  it  is  assigned  the 
context  posterior  corresponding  to  the  background  sample  x„  .  If  the  alarm  falls 
between  two  background  samples,  the  earlier  sample’s  context  posterior  is  used. 

As  in  the  previous  approaches,  the  context  posteriors  were  used  in  training  an 
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ensemble  of  RVMs  for  context-dependent  algorithm  fusion  on  the  prescreener,  EHD, 
HMM,  and  SPSCF  algorithm  confidences.  The  RVMs  were  trained  on  each  of  the 
alarms’  target  features  x„  '  using  the  mixture-of-RVMs  approach  described  in  Ap¬ 
pendix  B.  For  a  test  alarm  at  location  n,  each  of  the  RVMs  will  yield  a  within-context 
target  posterior,  p(ifi|x„  ,snm  =  1).  The  forward  variable,  an(m),  for  that  location 
is  then  calculated  using  the  learned  HMM  parameters  for  m  =  1,  2, ...,  M.  Finally,  a 
posterior  confidence  for  the  alarm  can  then  be  calculated  by 

M 

p(Hi  =  J>(Hi|xfUm  =  1  )p  =  1|tt,  A,©) 

m= 1 
M 

=  y^p(ffi|xf),gnm  =  1  )an(m)  (6.16) 

m= 1 

6.4  Experimental  Results 

An  experiment  was  performed  using  a  subset  of  the  GPR  data  set  that  was  used  in 
previous  chapters.  A  smaller  dataset  was  used  because  the  full  data  was  too  large 
for  efficiently  training  the  spatial  context  models  with  fine  downtrack  sampling.  The 
data  under  consideration  in  this  experiment  was  collected  at  an  Eastern  US  test 
site  under  dry  conditions  in  March  2009.  Four  test  lanes  (dirt,  gravel,  asphalt,  and 
concrete)  were  present  at  the  site.  The  target  population  consisted  of  10  types  of 
AT  landmines  plus  155mm  artillery  shells.  Empty  holes  were  present  and  scored  as 
clutter.  Overall,  a  total  of  764  targets  and  152  clutter  objects  were  encountered  over 
a  total  collection  area  was  12,383  nr2.  The  distribution  of  prescreener  alarms  with 
respect  to  the  four  lanes  is  summarized  in  Table  6.1. 

Evaluation  of  alarm  classification  was  performed  using  the  same  object-based 
cross-validation  technique  used  in  the  previous  experiments.  However,  the  spatial 
DPGMM  and  SBHMM  were  trained  outside  of  crossvalidation  since  they  utilized 
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Table  6.1:  Alarm  Distribution  by  Soil  Type  and  Ground  Truth  (Smaller  Data  Set) 


Soil 

Clutter  (%) 

Targets  (%) 

Total  (%) 

Dirt 

387  (24.2%) 

207  (25.8%) 

594  (24.7%) 

Gravel 

350  (21.8%) 

205  (25.6%) 

555  (23.1%) 

Asphalt 

245  (15.3%) 

212  (26.4%) 

457  (19.0%) 

Concrete 

620  (38.7%) 

178  (22.2%) 

798  (33.2%) 

ALL 

1,602  (100%) 

802  (100%) 

2,404  (100%) 

background  feature  sequences  instead  of  prescreener  alarms.  The  following  subsec¬ 
tions  provide  analysis  of  context-dependent  fusion,  including  the  performance  of  the 
context  models,  the  context-specific  RVMs,  and  overall  discrimination  performance. 

6-4-1  Context  Modeling  Performance 

Several  unique  realizations  of  the  DPGMM  and  SBHMM  context  models  were  ob¬ 
tained  through  random  h- means  initializations.  For  purposes  of  comparison,  we 
consider  the  case  in  which  both  models  yielded  seven  contexts.  Figure  6.6  compares 
the  means  of  the  context  distributions  that  were  learned  from  the  spatial  DPGMM 
and  SBHMM  models.  Other  than  DPGMM  Context  2,  the  means  of  the  context 
distributions  are  very  similar  in  both  cases.  In  addition,  the  learned  covariance  ma¬ 
trices  of  the  DPGMM  context  distributions  are  shown  in  Figure  6.7,  the  learned 
covariance  matrices  of  the  SBHMM  context  distributions  are  shown  in  Figure  6.8. 
The  covariance  matrices  appear  to  be  less  similar  than  the  means,  but  the  overall 
scale  and  structure  of  each  context’s  covariance  matrix  appears  similar  between  the 
two  models.  Comparing  the  Gaussian  densities  learned  for  both  models  therefore 
shows  that  the  spatial  dependency  leveraged  by  the  SBHMM  has  more  of  an  impact 
on  the  learned  emission  covariances  than  the  means. 

The  initial  state  probabilities  learned  for  the  SBHMM  are  plotted  in  Figure  6.9, 
and  the  state  transition  probability  matrix  is  shown  in  Figure  6.10.  The  initial  state 
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Figure  6.6:  Learned  context  means  for  the  spatial  DPGMM  and  SBHMM  context 
models  on  GPR  data.  Left:  means  learned  from  the  spatial  DPGMM,  Right:  means 
learned  from  the  SBHMM.  Feature  dimension  is  represented  by  the  horizontal  axis. 
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Figure  6.7:  Covariance  matrices  of  clusters  learned  by  the  spatial  DPGMM  con¬ 
text  model.  Each  panel  represents  the  covariance  matrix  of  the  Student-f  mixture 
components  obtained  by  integrating  over  the  DPGMM  parameters. 


probabilities  appear  relatively  uniform,  with  States  1,  3,  and  5  having  an  initial 
probability  close  to  0.2  and  States  2,  4,  6,  and  7  having  initial  probabilities  close  to 
0.1.  The  state  transition  matrix  has  a  moderate  diagonal,  but  the  probabilities  of 
remaining  in  one  state  are  not  as  high  as  what  would  be  expected.  This  result  was 
somewhat  surprising,  since  the  test  lanes  over  which  data  were  artificially  constructed 
and  short  in  length,  so  they  were  expected  to  be  relatively  homogeneous. 

Figures  6.11-6.14  illustrates  examples  of  the  raw  data,  background  contextual 
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Figure  6.8:  Covariance  matrices  of  clusters  learned  by  the  SBHMM  context  model. 
Each  panel  represents  the  covariance  matrix  of  the  Gaussian  emission  density  corre¬ 
sponding  to  each  context. 


features  (projected  to  3-D  PCA),  and  the  state  posteriors  for  both  the  SBHMM  and 
DPGMM  context  model  for  single  passes  down  each  of  the  four  lanes.  Figure  6.11 
corresponds  to  the  dirt  lane.  In  this  case,  the  SBHMM  assigned  high  posterior 
probability  of  being  in  Context  7  for  most  of  the  lane,  while  the  DPGMM  assigned 
higher  probability  to  either  Context  1  or  2,  and  lower  probability  to  Context  5  and 
6.  The  gravel  lane  is  shown  in  Figure  6.12,  where  the  DPGMM  and  SBHMM  context 
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Figure  6.9:  Initial  state  probabilities  learned  by  the  SBHMM  context  model.  State 
(context)  is  represented  by  the  horizontal  axis. 
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Learned  SBHMM  State  Transition  Probabilities  (A) 


State  (j) 


Figure  6.10:  State  transition  probabilities  learned  by  the  SBHMM  context  model. 
State  (context)  is  represented  by  the  horizontal  and  vertical  axes. 

models  appeared  to  behave  somewhat  similarly. 

Figure  6.13  shows  the  results  of  spatial  context  modeling  for  the  asphalt  lane. 
The  SBHMM  assigned  high  posterior  probability  of  being  in  Context  1,  4,  or  7  at 
any  given  position.  Meanwhile,  the  DPGMM  context  posteriors  appear  to  be  a  more 
“smoothed-over”  version  of  the  SBHMM  context  posterior,  assigning  moderate  prob¬ 
ability  to  multiple  contexts.  Comparing  the  two  models  here  shows  a  great  similarity 
in  where  the  contextual  changes  occurred  in  the  lane.  However,  the  SBHMM  yielded 
sharp  state  transitions  while  the  DPGMM  favored  gradual  transitions. 

Finally,  the  context  posteriors  for  the  concrete  lane  that  was  originally  shown  in 
Figures  6.1  and  6.2  are  shown  in  Figure  6.14.  A  similar  effect  to  what  was  seen  on 
the  asphalt  lane  is  shown  here,  in  that  the  SBHMM  assigns  posterior  probabilities 
close  to  one  or  zero  at  each  downtrack  location,  while  the  DPGMM  yields  moderate 
posteriors  at  transition  points.  Furthermore,  it  also  appears  that  the  SBHMM  is 
utilizing  more  contexts  on  this  lane.  This  can  be  seen  in  the  first  2000  downtrack 
samples,  where  the  SBHMM  utilizes  four  contexts  and  the  DPGMM  utilizes  three, 
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GPR  Data:  Dirt  Lane 


SBHMM  Context  Posteriors 


DPGMM  Context  Posteriors 


Downtrack 


Figure  6.11:  Example  GPR  data  from  the  dirt  lane  and  associated  state  posteriors 
from  SBHMM  and  DPGMM  context  models.  Top:  GPR  B-scan;  Center:  PCA 
of  background  context  features;  Bottom:  SBHMM  and  DPGMM  state  posteriors. 
Downtrack  position  is  represented  by  the  horizontal  axes. 
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GPR  Data:  Gravel  Lane 
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FIGURE  6.12:  Example  GPR  data  from  the  gravel  lane  and  associated  state  poste¬ 
riors  from  SBHMM  and  DPGMM  context  models.  Top:  GPR  B-scan;  Center:  PCA 
of  background  context  features;  Bottom:  SBHMM  and  DPGMM  state  posteriors. 
Downtrack  position  is  represented  by  the  horizontal  axes. 
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GPR  Data:  Asphalt  Lane 
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Figure  6.13:  Example  GPR  data  from  the  asphalt  lane  and  associated  state  poste¬ 
riors  from  SBHMM  and  DPGMM  context  models.  Top:  GPR  B-scan;  Center:  PCA 
of  background  context  features;  Bottom:  SBHMM  and  DPGMM  state  posteriors. 
Downtrack  position  is  represented  by  the  horizontal  axes. 
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as  well  as  in  the  remaining  samples  where  the  SBHMM  utilizes  three  contexts  and 
the  DPGMM  utilizes  two. 

It  should  be  noted  that  the  presence  of  landmine  and  clutter  signatures  in  the 
GPR  could  have  an  effect  on  spatial  context  modeling,  since  their  downtrack  positions 
are  likely  to  be  sampled  for  extracting  contextual  features.  The  presence  of  anomalies 
corresponding  to  targets  or  clutter  could  possibly  be  a  reason  why  the  SBHMM  tends 
to  yield  many  state  transitions  in  sections  where  the  DPGMM  suggests  a  single 
context.  Because  an  anomaly  does  not  appear  similar  to  the  previous  observation 
in  the  feature  sequence,  the  SBHMM  considers  the  anomaly  to  be  evidence  of  a 
state  transition  while  the  DPGMM  considers  it  to  be  more  of  a  statistical  outlier. 
In  previous  work  [119,121],  the  background  data  was  broken  into  segments  between 
target  positions,  and  the  context  model  was  trained  on  these  target-free  sequences. 
During  this  work,  it  was  very  difficult  to  extract  target-free  sections  of  the  lanes 
that  were  long  enough  to  effectively  model  the  underlying  contextual  factors.  It 
would  also  be  impossible  to  train  a  context  model  in  this  manner  using  held  data, 
since  extracting  target-free  sections  requires  ground  truth  for  the  alarms  that  were 
encountered.  Therefore,  this  approach  was  not  used  here  although  future  work  should 
investigate  how  to  reliably  train  a  spatially-dependent  context  model  in  the  presence 
of  known  subsurface  anomalies. 

6.Jh2  Context-Dependent  Fusion  Results 

The  spatial  DPGMM  and  SBHMM  context  models  assigned  a  posterior  context 
probability  to  locations  of  prescreener  alarms.  As  in  previous  chapters,  these  con¬ 
text  posteriors  were  used  in  training  context-specific  RYMs  for  linearly-fusing  the 
confidences  of  the  prescreener,  EHD,  SPSCF,  and  HMM  algorithms.  Figures  6.16 
and  6.15  illustrate  the  discriminant  weights  assigned  by  the  RVMs  to  the  algorithms 
in  each  of  the  7  contexts  identified  by  the  DPGMM  and  SBHMM. 
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GPR  Data:  Concrete  Lane 
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Figure  6.14:  Example  GPR  data  from  the  concrete  lane  and  associated  state  poste¬ 
riors  from  SBHMM  and  DPGMM  context  models.  Top:  GPR  B-scan;  Center:  PCA 
of  background  context  features;  Bottom:  SBHMM  and  DPGMM  state  posteriors. 
Downtrack  position  is  represented  by  the  horizontal  axes. 
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A  visual  comparison  of  the  fusion  weights  for  each  context  modeling  technique 
reveals  a  number  of  similarities.  For  both  the  DPGMM  and  SBHMM  context  model, 
the  RVM  assigns  positive  weight  to  the  prescreener  in  three  contexts,  negative  weight 
in  two  contexts,  and  zero  weight  in  two  contexts.  The  DPGMM  contexts  in  which 
the  prescreener  receives  negative  weight  are  Contexts  5  and  7,  and  the  SBHMM 
contexts  are  Contexts  1  and  2.  The  means  and  covariances  for  these  contexts  shown  in 
Figures  6. 6-6. 8  suggest  that  DPGMM  Context  5  and  SBHMM  Context  1  have  similar 
densities,  as  do  DPGMM  Context  7  and  SBHMM  Context  2.  The  context  posteriors 
shown  in  Figures  6.13  and  6.14  suggest  that  these  contexts  represent  pavement  - 
DPGMM  Context  5  and  SBHMM  Context  1  correspond  to  portions  of  the  asphalt 
lane,  and  DPGMM  Context  7  and  SBHMM  Context  2  correspond  to  portions  of 
the  concrete  lane.  The  negative  fusion  weight  assigned  to  the  prescreener  for  these 
contexts  implies  that  its  confidence  should  be  discounted,  perhaps  because  it  flags  too 
many  false  alarms  due  to  anomalous  responses  from  the  pavement /soil  subsurface 
layer. 

Furthermore,  the  EHD,  SPSCF,  and  HMM  algorithms  receive  fusion  weights 
that  are  quite  similar  between  the  two  context  modeling  approaches.  For  DPGMM 
contexts,  the  EHD  algorithm  is  relevant  in  six  contexts  while  for  SBHMM  context  it 
is  relevant  in  five.  The  SPSCF  algorithm  is  relevant  in  five  contexts  for  both  modeling 
approaches.  Finally,  the  HMM  receives  nonzero  weight  in  five  DPGMM  contexts  and 
four  SBHMM  contexts.  Although  it  appears  that  algorithms  are  generally  more  often 
relevant  in  DPGMM  contexts  than  in  SBHMM  contexts,  the  values  of  the  weights 
for  each  algorithm  are  similar  between  the  two  context  models. 

6-4-3  Detection  Performance 

Context-dependent  algorithm  fusion  using  the  SBHMM  and  spatial  DPGMM  con¬ 
text  models  were  evaluated  via  ten-fold  object-based  cross-validation,  as  the  fusion 
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FIGURE  6.15:  RVM  discriminant  weights  learned  for  algorithm  fusion  in  each  spatial 
DPGMM  context.  Each  stem  represents  a  particular  dimension  of  the  target  feature 
space,  the  vertical  axis  represents  the  weight  value,  and  the  individual  contexts  are 
indicated  by  line  color. 


RVM  Discriminant  Weights:  SBHMM  Contexts 
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FIGURE  6.16:  RVM  discriminant  weights  learned  for  algorithm  fusion  in  each 
SBHMM  context.  Each  stem  represents  a  particular  dimension  of  the  target  fea¬ 
ture  space,  the  vertical  axis  represents  the  weight  value,  and  the  individual  contexts 
are  indicated  by  line  color. 
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techniques  presented  in  previous  chapters  were.  In  this  experiment,  the  consistency 
of  the  SBHMM  and  DPGMM  context  models’  performance  were  compared  by  using 
multiple  random  initializations  of  the  VB  learning  algorithm.  Five  different  real¬ 
izations  of  the  DPGMM  and  SBHMM  context  models  were  considered.  For  sake  of 
comparison,  five  realizations  of  the  alarm-based  DPGMM  context  model  (proposed 
earlier  in  Chapter  4)  were  also  considered.  Only  one  realization  of  the  global  RVM 
was  necessary,  since  it  did  not  require  random  initialization. 

The  ROC  curves  for  context-dependent  fusion,  using  the  spatial  DPGMM  (green) 
and  SBHMM  (blue)  context  models  as  well  as  the  alarm-based  DPGMM  (red),  are 
plotted  in  Figure  6.17.  ROC  curves  for  five  realizations  of  each  model  are  shown, 
and  their  average  FARs  at  benchmark  PDs  are  shown  in  the  legend.  Performance  is 
compared  to  the  global  RVM,  which  incorporates  no  contextual  information,  whose 
ROC  is  shown  by  the  dashed  black  line  and  shaded  by  a  90%  confidence  region. 
Performance  is  also  compared  to  the  individual  fused  algorithms,  whose  ROC  curves 
are  shown  by  dotted  lines. 

As  in  previous  GPR  experiments,  results  illustrate  that  all  three  methods  for 
context-dependent  fusion  achieved  significantly  better  detection  performance  than 
global  RVM  fusion.  Furthermore,  both  spatial  context  modeling  techniques  showed 
better  fusion  performance  than  the  alarm-based  DPGMM,  with  the  most  significant 
reductions  of  FAR  occurring  between  PDs  of  0  and  0.85.  At  PD  >  0.90,  all  ap¬ 
proaches  operate  at  similar  FARs  although  some  realizations  of  the  SBHMM  result 
in  better  performance. 

An  interesting  result  is  the  differences  in  consistency  between  the  three  context- 
dependent  fusion  methods.  Although  the  alarm-based  DPGMM  did  not  achieve 
the  same  level  of  performance  as  the  spatial  context  models,  the  ROC  curves  for 
context-dependent  fusion  using  the  alarm-based  DPGMM  illustrate  very  consistent 
performance.  On  the  other  hand,  the  ROC  curves  obtained  by  using  spatial  context 
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Figure  6.17:  ROC  curves  for  context-dependent  fusion,  using  SBHMM  (blue) 
and  DPGMM  (green)  spatial  context  models,  compared  to  alarm-based  context- 
dependent  fusion  (red),  global  RVM  fusion  (black  dashed),  and  the  individual  fused 
algorithms  (dotted). The  ROC  consists  of  PD  versus  FAR,  measured  in  false  alarms 
per  square  meter,  as  a  function  of  decision  threshold. 


models  appear  to  be  less  consistent,  with  the  SBHMM’s  performance  fluctuating 
more  than  the  DPGMM’s.  This  could  be  due  to  several  factors,  such  as  poor  choice  of 
hyperparameters,  not  running  VB  for  long  enough,  or  insufficient  context  pruning  to 
obtain  a  more  consistent  solution.  Regardless  of  the  reason,  that  the  spatial  DPGMM 
is  a  simpler  model  that  appears  to  offer  similar  but  more  consistent  performance,  and 
would  be  a  better  choice  for  modeling  context  in  this  data. 
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6.5  Conclusion 


In  military  route  clearance  applications  for  GPR,  large  stretches  of  target-free  back¬ 
ground  data  may  be  recorded  on  an  excursion  that  could  last  for  many  kilometers. 
Although  subsurface  anomalies  may  not  be  present,  background  data  collected  for 
large  periods  could  be  a  valuable  source  of  contextual  information.  Therefore,  the 
concept  of  context  modeling  was  extended  in  this  chapter,  in  which  two  methods 
were  proposed  for  modeling  context  with  respect  to  downtrack  position. 

The  proposed  approaches  utilized  features  that  were  extracted  through  back¬ 
ground  sampling  at  regular  intervals  down  the  test  lanes.  The  resulting  feature  se¬ 
quences  were  then  modeled  using  either  a  DPGMM  or  SBHMM  to  obtain  posterior 
context  probabilities  at  each  downtrack  location.  While  the  DPGMM  treated  each 
sample  as  statistically  independent,  the  SBHMM  assumed  a  degree  of  dependency 
between  neighboring  samples.  The  incorporation  of  dependency  via  the  SBHMM 
was  motivated  by  the  fact  that  many  environmental  factors,  such  as  soil  moisture, 
may  be  correlated  spatially. 

Experimental  results  illustrated  that  both  spatial  context  modeling  approaches 
were  able  to  provide  an  intuitive  description  of  how  contextual  factors  vary  over  the 
course  of  a  given  area.  Comparisons  between  the  model  parameters  learned  for  the 
DPGMM  and  SBHMM  illustrated  that  the  learned  contexts  had  similar  probability 
density  functions.  Furthermore,  comparisons  of  the  context  posteriors  showed  that 
both  models  generally  agreed  on  where  contextual  transitions  occurred  in  each  lane. 
However,  one  major  difference  was  that  the  DPGMM  favored  gradual  transitions 
while  the  SBHMM  implied  that  sharp  transitions  took  place. 

Evaluation  of  context-dependent  fusion  using  the  two  spatial  context  models  was 
performed  on  a  subset  of  the  data  used  in  previous  chapters.  Performance  was 
compared  to  context-dependent  fusion  using  the  alarm-based  DPGMM  that  was 
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proposed  in  Chapter  4.  The  ROC  curves  showed  that  spatial  context  modeling 
provided  additional  performance  benefits  over  alarm-based  context  modeling.  How¬ 
ever,  the  performance  improvements  obtained  through  the  SBHMM  were  shown  to 
be  less  consistent  than  those  obtained  from  the  spatial  DPGMM.  Therefore,  it  was 
concluded  that  the  spatial  DPGMM  would  be  a  better  choice  for  spatial  context 
modeling,  since  it  is  a  simpler  model  that  was  more  consistent  and  yielded  similar 
detection  performance. 
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7 


Applications  to  Hyperspectral  Sensing 


In  this  chapter,  the  context-dependent  learning  framework  originally  developed  for 
buried  threat  detection  with  GPR  is  applied  to  an  alternative  sensing  modality,  hy¬ 
perspectral  imagery  (HSI).  Airborne  hyperspectral  sensing  is  a  particularly  attrac¬ 
tive  option  for  detecting  buried  explosive  threats,  since  it  allows  for  greater  standoff 
distance  than  GPR  and  can  be  used  to  survey  wide  areas  quickly.  Furthermore,  dis¬ 
turbed  earth  yields  a  distinctive  signature  in  HSI  that  can  potentially  be  indicative 
of  buried  threats  such  as  landmines  and  IEDs. 

The  following  sections  provide  background  information  on  HSI  as  well  as  the 
contextual  factors  affecting  detection  of  buried  threats.  Two  techniques  for  extracting 
contextual  features  are  considered.  The  first  technique  utilizes  a  PCA  projection 
of  the  background  spectra,  and  is  useful  in  characterizing  different  times  of  day. 
The  second  technique  is  based  on  spectral  unmixing,  which  involves  finding  the 
spectra  of  the  constituent  materials  present  in  the  scene  and  how  the  abundances  of 
those  materials  vary  between  observations.  Finally,  context-dependent  band  selection 
was  used  to  classify  prescreener  alarms  recorded  over  a  wide  area.  Three  context- 
dependent  approaches  are  compared  -  supervised  context  learning,  nonparametric 
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generative  context  learning  with  the  DPGMM,  and  nonparametric  discriminative 
context  learning  with  the  DPGMM-RVM.  As  was  done  previously  for  experiments 
with  GPR  data,  performance  is  compared  to  a  single  RVM  and  several  algorithms 
from  the  past  literature. 

7.1  Hyperspectral  Imagery 

7.1.1  Background 

Hyperspectral  sensors  collect  measurements  of  spectral  radiance  from  many  con¬ 
tiguous  spectral  bands.  HSI  is  used  in  a  variety  of  remote  sensing  applications,  but 
system  specifications  vary  widely  with  application  area.  For  example,  one  of  the  most 
popular  sensors  in  the  research  community  is  the  NASA  Airborne  Visible/Infrared 
Imaging  Spectrometer  (AVIRIS)  sensor,  which  has  been  collecting  images  in  224 
bands  between  400  and  2500  nm  for  geological,  agricultural,  and  urban  mapping 
applications  [122],  Meanwhile,  the  Airborne  Hyperspectral  Imager  (AHI)  developed 
by  the  University  of  Hawaii  for  subsurface  and  littoral  sensing,  is  constrained  to  70 
bands  in  the  long- wave  infrared  (LWIR)  spectrum  of  wavelengths  ranging  from  8-12 
/irn  [123]. 

Anomaly  detection  in  HSI  is  typically  performed  by  estimating  background  statis¬ 
tics  and  using  a  metric  based  on  the  likelihood  ratio.  An  example  of  this  approach 
is  the  popular  RX  detector  [124],  which  uses  adaptive  whitening  to  estimate  the 
local  covariance  of  the  background  near  pixels  of  interest.  Examples  of  targets  and 
false  alarms  detected  by  RX  on  a  hyperspectral  data  set  collected  over  a  minefield 
at  an  arid  site  in  the  Western  US  are  shown  in  Figures  7.1  and  7.2.  Each  figure 
displays  a  series  of  several  15x15  image  chips,  centered  around  detected  anomalies. 
For  visualization  purposes,  the  chips  were  averaged  over  the  70  spectral  bands. 

The  surface  and  volume  scattering  of  recently-disturbed  earth  at  the  target  loca¬ 
tion  often  yields  a  peak  intensity  called  the  reststrahlen  effect  [125].  In  Figure  7.1, 
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FIGURE  7.1:  Example  HSI  image  chips  corresponding  to  antitank  landmines 
recorded  by  the  RX  detector  over  a  minefield  located  at  an  arid  Western  US  test 
site. 

the  reststrahlen  effect  is  evidenced  by  red  peak  at  the  center  of  many  of  the  image 
chips.  Meanwhile,  most  false  alarms  shown  in  Figure  7.2  do  not  exhibit  the  same 
type  of  signature.  Because  the  reststrahlen  signatures  are  confined  to  a  small  local 
area,  the  RX  detector  provides  a  good  method  for  detecting  these  types  of  anomalies. 
However,  the  substantial  false  alarm  rate  has  relegated  its  use  to  prescreening  in  past 
experiments  with  the  AH1  sensor  [65,126]. 

7.1.2  Environmental  Effects  on  HSI  Sensing 

Because  HSI  measures  spectral  radiance  over  a  wide  spectral  range  that  can  include 
visible  and/or  infrared  (IR)  portions  of  the  spectrum,  several  environmental  factors 
can  potentially  affect  the  data.  Occlusions  such  as  clouds  or  heavy  smog  may  impact 
the  line-of-sight  visibility  of  objects  from  the  sensor’s  position,  which  may  impact 
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FIGURE  7.2:  Example  HSI  image  chips  corresponding  to  false  alarms  recorded  by 
the  RX  detector  over  a  minefield  located  at  an  arid  Western  US  test  site. 

measurements  taken  in  the  visible  part  of  the  spectrum  [127].  For  HSI  collected  in 
the  IR  spectrum,  ambient  solar  radiance  and  temperature  are  important  contextual 
factors  since  they  affect  the  thermal  emissions  of  the  ground  and  objects  that  lie  on 
the  surface  [123,126].  Figure  7.3  illustrates  example  spectra  of  three  landmine  targets 
(i.e.  buried/surface  AT  landmines  and/or  disturbed  earth)  and  false  alarms  (i.e. 
bare  soil  and/or  vegetation)  collected  at  three  times  of  day:  morning,  afternoon,  and 
night.  Note  the  difference  in  magnitude  between  spectra  collected  at  each  time.  The 
afternoon  spectra,  collected  after  the  ground  has  absorbed  much  solar  radiation,  are 
of  the  highest  magnitude.  Meanwhile,  the  night  spectra  have  the  lowest  magnitude. 

Also  note  that  the  overall  shape  of  target  and  false  alarm  spectra  are  quite  similar 
and  only  subtle  differences  may  distinguish  targets  of  interest  from  non-threatening 
anomalies.  Furthermore,  the  shape  of  target  and  false  alarm  spectra  vary  with  respect 
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FIGURE  7.3:  Example  target  and  false  alarm  spectra  from  HSI  collected  in  morning 
(solid  lines),  afternoon  (dotted  lines),  and  night  (dashed  lines).  Target  spectra  are 
provided  in  the  top  panel,  and  false  alarm  spectra  in  the  bottom  panel. 


to  time  of  day.  For  example,  the  morning  and  afternoon  spectra  for  both  targets  and 
false  alarms  exhibit  a  local  peak  around  the  fifth  spectral  band.  Meanwhile,  this  peak 
does  not  appear  in  the  night  signatures.  These  differences  between  spectra  suggest 
that  time  of  day,  and  the  lighting  and  temperature  conditions  associated  with  it, 
are  contextual  factors  that  should  potentially  be  considered  in  target  classification 
processing. 

7.1.3  Buried  Threat  Detection  with  HSI  in  Changing  Conditions 

Anomaly  detection  in  HSI  requires  proper  modeling  of  the  background  in  order  for 
the  spectra  of  interest  to  properly  appear  as  anomalous.  The  most  basic  approach 
is  to  model  the  background  as  Gaussian-distributed  with  parameters  estimated  by 
maximum-likelihood  statistics.  This  is  the  method  employed  by  the  RX  detector, 
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which  may  use  a  sliding  window  to  adaptively  estimate  the  background  mean  and 
covariance,  and  declares  outliers  as  anomalies  [124],  A  similar  approach  to  mitigating 
local  variations  in  background  is  by  applying  a  multimodal  statistical  model,  such  as 
a  mixture  of  Gaussians  [128].  Another  approach  is  to  apply  a  transformation  to  the 
data  that  yields  a  feature  space  invariant  to  background  changes,  such  by  adaptive 
whitening  and  dewhitening  [129].  These  past  techniques  were  shown  to  be  effective  in 
cases  where  the  background  is  spatially  non-stationary.  However,  parametric  models 
for  high-dimensional  data  are  difficult  to  learn  robustly,  and  incorporate  little  to 
no  prior  knowledge  regarding  sensor  phenomenology.  Context-dependent  learning  is 
a  potential  method  for  exploiting  knowledge  of  sensor  phenomenology  to  improve 
detection  performance  across  varying  environments.  In  the  HSI  literature,  local 
context-based  processing  was  originally  proposed  for  smoothing  out  segmentation 
maps  [62,63].  In  this  chapter,  contextual  information  is  utilized  to  improve  anomaly 
classification  in  HSI  using  the  same  learning  framework  that  was  originally  developed 
for  a  similar  problem  in  GPR. 

7.2  HSI  Data  Set 

The  HSI  data  used  in  this  work  was  collected  with  the  AHI  sensor  as  part  of  a 
Wide  Area  Airborne  Minefield  Detection  (WAAMD)  platform,  and  has  been  used  in 
several  past  evaluations  of  context-dependent  landmine  detection  algorithms  [65,126, 
130-132],  A  total  of  8  images  (corresponding  to  individual  flyovers)  were  collected 
over  a  minefield  in  the  Southwestern  US  at  times  labeled  “morning,”  “afternoon,” 
and  “night”.  The  minefield  contained  both  surface-laid  and  buried  metal  anti-tank 
targets,  as  well  as  many  empty  holes  which  were  also  counted  as  targets.  The  RX 
detector  was  run  on  the  data,  and  a  total  of  4,591  image  chips,  each  consisting  of  a 
15x15x70  data  block,  were  extracted  around  the  detected  anomalies.  A  total  of  755 
chips  were  labeled  as  targets  (Hi),  and  3,836  chips  were  labeled  as  clutter  ( H0 ). 
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Target  and  Background  Regions  of  HSI  Image  Chip 


2  4  6  8  10  12  14 

FIGURE  7.4:  Illustration  of  the  context  and  target  feature  extraction  regions  for  a 
typical  HSI  image  chip.  Note  that  the  image  shown  was  averaged  over  all  70  spectral 
bands  for  visualization  purposes. 

The  following  section  proposes  methods  for  extracting  contextual  and  target  fea¬ 
tures  from  the  HSI  chips.  Two  contextual  feature  extraction  approaches  are  con¬ 
sidered,  as  was  previously  done  in  [132],  The  first  is  based  on  the  raw  background 
spectra,  which  is  useful  for  characterizing  different  times  of  day  based  on  magnitude 
differences.  The  second  is  based  on  spectral  unmixing,  and  is  used  to  characterize 
the  constituent  spectra  that  make  up  the  background. 

7.3  Feature  Extraction  from  HSI  Data 

Figure  7.4  illustrates  the  regions  of  a  sample  HSI  chip  where  contextual  and  target 
features  were  extracted.  As  in  alarm-based  processing  of  GPR  data,  contextual 
features  were  extracted  from  the  background  data  proximate  to  the  detected  anomaly. 
Meanwhile,  target  features  were  extracted  from  the  5x5  central  region  by  averaging 
all  of  the  pixels  within  that  region  to  yield  a  70-D  target  feature  vector. 

The  first  context  learning  technique  is  based  on  exploiting  the  magnitude  differ- 
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ences  that  are  associated  with  times  of  day.  However,  it  is  also  important  to  consider 
the  case  if  all  samples  were  collected  at  the  same  time  of  day,  which  eliminates  the 
possibility  of  temporal  contextual  effects.  Therefore,  the  second  technique  is  based 
on  spectral  unmixing  to  learn  the  constituent  spectra  of  the  pure  materials  present 
in  different  parts  of  the  scene.  The  two  contextual  feature  extraction  techniques  are 
described  in  the  following  sections. 

7.3.1  Context  Learning  Based  on  Background  Spectra 

Recall  Figure  7.3,  which  illustrated  how  the  magnitude  of  target  and  false  alarm 
spectra  varies  substantially  with  respect  to  time  of  day.  These  observations  suggest 
that  temporal  context  can  be  inferred  directly  from  the  raw  HSI  data.  Therefore, 
the  first  context  learning  technique  that  was  considered  utilized  the  background 
region  of  the  image  chips  (the  outer  square  in  Figure  7.4)  to  characterize  whether 
the  observation  was  collected  in  the  morning,  afternoon,  or  night. 

Contextual  feature  extraction  was  performed  by  averaging  the  pixels  in  the  back¬ 
ground  region,  projecting  the  70-D  mean  to  3-D  with  PCA,  and  then  normalizing  to 
zero-mean  and  unit- variance.  Figure  7.5  illustrates  a  scatterplot  of  the  principal  com¬ 
ponents  of  the  averaged  background  data,  colored  by  time  of  day.  The  background 
data  forms  three  distinct  clusters  for  morning,  afternoon,  and  night. 

After  extracting  the  3-D  background-based  context  features,  they  were  provided 
as  input  to  a  statistical  context  model.  In  this  chapter,  three  context  models  are 
compared.  The  basic  supervised  Gaussian  hypothesis  test,  which  as  described  in 
Section  3.1,  serves  as  a  baseline.  In  addition,  two  nonparametric  context  models 
were  considered  -  the  generative  DPGMM,  which  was  originally  presented  in  Sec¬ 
tion  4.3,  and  the  discriminative  DPGMM-RVM,  which  was  described  in  Section  5.3. 
In  learning  all  three  context  models,  any  hyperparameter  settings  were  set  to  the 
same  values  used  in  previous  GPR  experiments. 
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Known  Labels  -  PCA  of  Background  Spectra 
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FIGURE  7.5:  Scatterplot  of  the  3-D  PCA  projection  of  the  averaged  background 
pixels  of  each  image  chip,  colored  by  time  of  day. 

7.3.2  Context  Learning  Based  on  Spectral  Unmixing 

Another  scenario  to  consider  is  if  data  was  collected  under  similar  lighting  and  tem¬ 
perature  conditions.  Although  the  data  set  used  in  this  work  was  collected  at  different 
times  of  day,  temporal  effects  were  mitigated  to  simulate  data  collected  at  the  same 
time  of  day.  This  was  accomplished  by  subtracting  the  means  from  the  background 
and  target  regions  of  all  image  chips  recorded  in  a  single  flyover. 

In  the  scenario  where  all  observations  are  viewed  under  similar  lighting  and  tem¬ 
perature,  potential  contextual  factors  could  be  obtained  through  the  spectral  com¬ 
position  of  the  background.  This  problem  has  been  pursued  extensively  in  the  HSI 
literature  as  spectral  unmixing ,  i.e.  the  expression  of  image  pixels  as  a  finite  sum 
of  known  constituent  spectra.  These  constituent  spectra  are  known  as  endmembers , 
and  are  representative  of  the  pure  elements  present  in  the  scene.  The  following  sub¬ 
sections  discusses  the  linear  mixing  model  as  well  as  the  technique  used  for  extracting 
contextual  features. 
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7.3.3  Linear  Spectral  Mixing  Model 

In  Chapter  2,  a  simple  phenomenological  model  was  proposed  for  motivating  contex¬ 
tual  features  from  GPR  data.  This  was  the  transmission  line  model,  and  although 
it  is  based  on  very  broad  physical  assumptions  it  proved  effective  in  characterizing 
quantitative  properties  of  the  soil  environment.  In  HSI,  a  simple  phenomenological 
model  that  is  often  used  is  the  linear  mixing  model,  which  is  based  on  the  assumption 
that  each  of  N  pixels  is  a  linear  combination  of  M  endmember  spectra  representing 
the  pure  elements  present  in  the  scene: 


x,„  = 


M 

£• 

m= 1 


r  E 

bnm±-Jm 


(7.1) 


W  anm  =  1  (7.2) 

m=  1 

anm  >  0  (7.3) 

In  (7.1),  x„  is  the  nth  D- dimensional  pixel  in  the  image  (n  =  1,2, ... ,N ),  Em  is  the 
mth  endmember  spectrum  (m  =  1,  2, ...,  M )  and  the  mth  column  of  the  endmember 
matrix  (E),  anm  is  the  abundance  of  endmember  m  in  pixel  n,  and  en  is  a  random 
error  term.  The  abundances  are  constrained  to  be  greater  than  zero  and  sum  to  one. 
If  there  is  no  error,  all  of  the  pixels  lie  within  an  M-simplex  in  a  .D-dimensional  space, 
where  the  endmembers  correspond  to  the  vertices  of  the  simplex.  The  abundances 
form  a  simplex  as  well,  but  in  M-dimensional  space.  However,  since  they  must  sum  to 
one,  the  abundances  contain  redundant  information.  By  projecting  the  abundances 
onto  the  simplex,  they  can  serve  as  a  feature  space  of  dimensionality  M  —  1  from 
which  a  context  model  can  be  learned. 
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7.3.4  Endmember  Extraction 


A  common  problem  in  HSI  is  that  the  endmembers  for  a  particular  scene  are  of¬ 
ten  unknown,  so  they  must  be  learned  from  the  image  data.  Several  endmember 
extraction  algorithms  have  been  proposed  for  unmixing  HSI  into  its  constituent 
spectra  [133-136].  Endmember  extraction  algorithms  often  exploit  the  geometric 
interpretation  of  the  linear  mixing  model.  One  of  the  earliest  endmember  extraction 
techniques  is  N-FINDR  [133],  which  initializes  the  endmembers  with  random  pixels 
and  iteratively  grows  the  simplex  to  include  all  pixels.  However,  a  weakness  of  N- 
FINDR  and  similar  techniques  is  that  they  inherently  assume  that  at  least  one  pure 
pixel  is  present  in  the  image. 

More  recent  approaches  treat  the  endmember  extraction  task  as  an  optimization 
problem  [134-136].  For  example,  the  iterative  constrained  endmembers  (ICE)  al¬ 
gorithm  [134]  optimizes  a  trade-off  between  minimizing  the  residual  sum-of-squares 
(RSS)  between  the  pixels  and  the  linear  mixing  model,  and  minimizing  the  sum  of 
squared  distances  (SSD)  between  the  endmembers.  RSS  is  calculated  by 

N  /  M  \T  /  M  \ 

RSS  =  ^  ^  I  xn  'y  anmEm  J  I  xn  'y  anmEm  J  ,  (7.4) 

n=l  \  m=  1  /  \  m=  1  / 

and  SSD  is  calculated  by 

M— 1  M 

SSD  =5]  J]  (Em  —  Ek)T  (Em  —  Efc)  =  M(M  —  1)V , 

m=  1  k=m+ 1 

where  V  is  the  sum  of  the  variances  (over  each  band)  of  the  endmembers.  The 
objective  function  minimized  by  ICE  in  learning  the  endmembers  is  given  by 

RSS 

RSSreg  =  (l-fi)—  +11V,  (7.5) 

where  fi  is  a  parameter  set  to  the  trade-off  between  RSS  and  SSD  (which  is  pro¬ 
portional  to  V).  ICE  uses  an  iterative  process  to  minimize  RSSreg  with  respect  to 
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the  endmembers  and  abundances.  Given  a  single  row  of  the  endmember  matrix  (E), 
denoted  by  for  d  =  1,2 RSS  is  minimized  (subject  to  the  constraints  on 
the  abundances)  using  quadratic  programming.  This  step  yields  an  estimate  of  the 
abundances  A  =  {anm}.  Then,  given  A,  the  endmembers  that  minimize  RSSreff  are 
given  by 


ed 


AtA  + _ — _ 

[  +(M  — 1)(1  — /r) 


(7.6) 


where  xd  is  an  JVxl  vector  consisting  of  the  dth  dimension  of  of  all  pixels. 

To  illustrate  the  performance  of  ICE  in  extracting  endmembers  from  hyperspec- 
tral  data,  a  synthetic  example  is  shown  in  Figure  7.6.  Three-dimensional  data  was 
generated  by  1000  draws  from  a  Dirichlet(  1,1,1)  distribution.  Since  the  data  forms  a 
simplex,  ICE  should  find  endmembers  close  to  the  simplex  vertices.  However,  min¬ 
imizing  RSS  alone  {/i  =  0)  would  yield  a  simplex  large  enough  to  enclose  all  the 
pixels.  Furthermore,  minimizing  SSD  (/i  =  1)  would  yield  another  degenerate  case 
in  which  all  endmembers  would  converge  to  the  mean  of  the  data.  Instead,  the  yU 
parameter  must  be  set  to  balance  the  desired  trade-off  between  the  two.  Figure  7.6 
shows  the  result  of  ICE  on  the  synthetic  data  using  three  different  values  of  /i.  Note 
that  as  yU  increases,  the  learned  endmembers  move  towards  the  mean  of  the  pixel 
data. 

Another  illustration  of  the  performance  of  ICE  is  shown  in  Figure  7.7.  The  top 
plot  illustrates  spectra  of  three  materials  from  the  US  Geological  Survey  spectral 
library  [137].  A  total  of  1000  random  mixtures  of  these  materials  were  simulated  by 
drawing  abundances  from  a  Dirichlet(  1,1,1)  distribution.  ICE  was  run  on  the  mixed 
data  with  M  =  3  and  yU  =  .001.  The  3  endmembers  extracted  by  ICE  are  shown  in  the 
bottom  plot,  and  match  the  constituent  spectra  above  very  closely.  Endmember  1  is 
approximately  equal  to  the  spectrum  for  rabbitbrush,  Endmember  2  is  approximately 
equal  to  the  spectrum  of  juniper,  and  Endmember  3  is  approximately  equal  to  the 
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Endmember  Extraction  with  ICE:  Synthetic  Data  Example 


Dim  2 

Figure  7.6:  Results  of  endmember  extraction  using  ICE  on  3-dimensional  toy  data 
with  (i  =  0.1  (black),  /x  =  0.01  (green),  and  /i  =  0.00001  (red).  The  pixel  data  are 
represented  by  the  blue  points. 

spectrum  of  grass. 

In  this  work,  ICE  was  used  as  a  technique  for  contextual  feature  extraction  for 
HSI.  To  eliminate  temporal  differences  in  spectral  magnitude,  the  means  of  the  back¬ 
ground  and  target  pixels  for  each  time  of  day  were  subtracted  from  the  images.  Then, 
for  each  anomaly  detected  by  RX,  the  pixels  in  the  background  region  of  each  15  x  15 
chip  were  averaged.  ICE  was  run  with  M  —  4  and  /j  =  .001  on  the  aggregation  of 
the  averaged  background  spectra  for  all  detected  anomalies.  A  larger  M  could  po¬ 
tentially  be  used,  but  experiments  with  larger  values  of  M  resulted  in  endmembers 
that  were  redundant  or  had  negligible  abundance.  It  should  also  be  noted  that  a 
sparseness-promoting  modification  of  ICE  (SPICE)  has  been  proposed  for  learning 
the  number  of  endmembers  [135] .  However,  the  number  of  endmembers  was  fixed  for 
the  sake  of  comparison  with  the  3-D  PCA-based  features  discussed  in  Section  7.3.1. 
The  learned  endmember  spectra  are  shown  in  Figure  7.8,  and  they  are  clearly  distinct 
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Endmember  Spectra  for  HSI  Example 


FIGURE  7.7:  Results  of  endmember  extraction  using  ICE  on  a  synthetic  mixture 
of  endmember  spectra  from  the  USGS  spectral  library.  The  top  plot  illustrates  the 
reflectance  of  three  materials  measured  over  many  wavelengths  by  spectroradiome- 
ter.  The  bottom  plot  illustrates  the  endmembers  extracted  from  N  =  1000  random 
mixtures  of  the  above  spectra  by  ICE. 


from  one  another. 

After  extracting  the  four  endmembers  and  calculating  their  abundances  for  each 
chip,  the  abundances  were  projected  onto  the  simplex  to  yield  3-D  contextual  fea¬ 
tures  which  were  then  normalized  to  zero-mean  and  unit  variance.  It  is  expected  that 
different  contexts  should  be  characterized  by  differences  in  endmember  abundances. 
Therefore,  the  contextual  features  should  be  amenable  to  clustering  by  a  statistical 
mixture  model.  Like  the  background-based  features  proposed  in  the  previous  sec¬ 
tion,  context  learning  was  performed  using  the  supervised,  generative  DPGMM,  and 
discriminative  DPGMM-RVM  models. 

7.4  Experimental  Results 

Experimental  results  are  presented  in  this  section,  illustrating  the  results  of  context 
learning,  context-dependent  band  selection,  and  overall  detection  performance  on 
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2000 


Learned  Endmember  Spectra  from  HSI  Chips  Background 


FIGURE  7.8:  Endmember  spectra  extracted  from  background  regions  of  AHI  image 
chips  using  ICE  with  /i  =  0.001. 

the  HSI  data  described  in  Section  7.2.  In  each  of  the  following  subsections,  results 
are  compared  for  context  learning  using  background-based  and  endmember-based 
features.  By  considering  both  sets  of  features  separately,  the  relevant  differences 
between  exploiting  temporal  and  environmental  contexts  can  be  seen.  Detection 
performance  is  also  compared  to  a  single  linear  RVM,  which  incorporates  no  contex¬ 
tual  information,  as  well  as  to  methods  that  attempt  to  mitigate  contextual  effects  via 
whitening/dewhitening  [129]  and  a  mixture  of  Gaussians  [128],  and  the  RX  detector 
which  was  used  as  a  prescreener  [124]. 

7-4- 1  Context  Learning  Results 

As  discussed  in  Section  7.3.1,  the  background  spectra  should  be  indicative  of  differ¬ 
ent  times  of  day.  Scatterplots  of  the  PCA-projected  background  spectra  are  shown 
in  Figure  7.9,  with  points  colored  according  to  their  maximum  a  posteriori  (MAP) 
temporal  label  (top-left),  DPGMM-learned  context  (top-right),  and  discriminatively- 
learned  context  (bottom).  The  top-left  plot  shows  that  supervised  context  learning 
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Supervised  Context  -  PCA  of  Background  Spectra 


DPGMM  Context  -  PCA  of  Background  Spectra 


Discrim.  Context  -  PCA  of  Background  Spectra 
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FIGURE  7.9:  Scatterplots  illustrating  results  of  supervised  (top-left),  generative 
DPGMM  (top-right),  and  discriminative  (bottom)  context  learning  from  the  PCA- 
projected  background  spectra.  Points  are  colored  according  to  their  MAP  context. 


successfully  classifies  points  according  to  the  time-of-day  labels  originally  shown  in 
Figure  7.5.  The  top-right  plot  illustrates  how  the  DPGMM  splits  the  three  tempo¬ 
ral  categories  into  sub-contexts  that  may  be  reflective  of  more  subtle  differences  in 
spectrum  magnitude.  Finally,  the  bottom  plot  illustrates  that  discriminative  context 
learning  partitions  the  feature  space  in  a  different  manner  than  generative  context 
learning. 

The  similarity  of  the  three  context  learning  methods  is  compared  in  Table  7.1, 
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Table  7.1:  AMI  of  HSI  Context  Models  Trained  on  PCA  of  Background  Spectra 


Supervised 

DPGMM 

Discriminative 

Supervised 

1 

0.7594 

0.6947 

DPGMM 

0.7594 

1 

0.8669 

Discriminative 

0.6947 

0.8669 

1 

which  compares  the  pairwise  adjusted  mutual  information  (AMI)  [110]  of  the  MAP 
context  assignments.  Recall  that  an  AMI  of  one  corresponds  to  identical  cluster 
assignments,  and  an  AMI  of  zero  corresponds  to  a  mutual  information  expected  by 
chance.  The  DPGMM  and  discriminative  context  models  are  most  similar,  having 
an  AMI  of  0.8669.  The  supervised  and  discriminative  models  are  most  different, 
with  an  AMI  of  0.6947.  However,  the  tabulated  AMI  values  are  all  relatively  high, 
suggesting  a  great  degree  of  similarity  between  the  three  models  despite  learning 
different  numbers  of  contexts. 

It  was  suggested  in  Section  7.3.2  that  if  time-of-day  effects  are  corrected  for 
(by  subtracting  the  mean  from  the  background  and  target  regions),  spectral  un¬ 
mixing  may  characterize  the  variations  in  endmember  abundance  throughout  the 
scene.  Endmember-based  context  learning  was  performed  on  the  projection  of  the 
abundances  onto  the  3-simplex  (a  tetrahedron).  Unlike  the  averaged  background 
spectra,  the  endmember  abundances  were  not  expected  to  characterize  time  of  day. 
Instead,  they  were  expected  to  characterize  local  populations  of  observations  where 
the  endmember  spectra  mix  differently  in  the  background. 

Figure  7.10  illustrates  scatterplots  of  the  endmember  features,  with  points  colored 
according  to  their  MAP  time  of  day  (top- left),  DPGMM-learned  context  (top-right), 
and  discriminatively-lcarned  context  (bottom).  The  greatest  difference  between  these 
scatterplots  and  those  in  Figure  7.9  is  that  the  features  do  not  cluster  according  to 
time  of  day.  This  comes  as  no  surprise,  since  the  HSI  chips  were  mean-subtracted 
to  eliminate  the  magnitude  differences  caused  by  temporal  changes  in  lighting  and 
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Supervised  Context  -  Null  Space  of  Endmember  Abundances 


DPGMM  Context  -  Null  Space  of  Endmember  Abundances 
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Figure  7.10:  Scatterplots  illustrating  results  of  supervised  (top-left),  generative 
DPGMM  (top-right),  and  discriminative  (bottom)  context  learning  from  the  end- 
member  abundances  learned  by  ICE.  Points  are  colored  according  to  their  MAP 
context. 

temperature.  However,  the  top-right  and  bottom  plots  show  that  the  nonparamet- 
ric  models  allow  for  contexts  to  be  learned  in  an  unsupervised  manner.  Generative 
context  learning  with  the  DPGMM  found  11  contexts,  and  discriminative  context 
learning  with  the  DPGMM-RVM  found  17  contexts.  In  both  cases,  more  contexts 
were  learned  from  the  endmember  features  than  from  the  averaged  background  spec¬ 
tra,  suggesting  that  the  endmember  features  may  be  indicative  of  more  localized 
contextual  factors. 
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Table  7.2:  AMI  of  HSI  Context  Models  Trained  on  Endmember  Abundances 


Supervised 

DPGMM 

Discriminative 

Supervised 

1 

0.3402 

0.3190 

DPGMM 

0.3402 

1 

0.6214 

Discriminative 

0.3190 

0.6214 

1 

The  AMI  of  the  contexts  learned  from  the  endmember  abundances  are  summa¬ 
rized  in  Table  7.2.  The  DPGMM  and  discriminative  context  models  were  the  most 
mutually-informative,  with  an  AMI  of  0.6214.  However,  they  were  less  mutually  in¬ 
formative  as  they  were  when  trained  on  the  background  spectra,  suggesting  that  the 
models  behave  more  differently  when  trained  on  endmember  abundances.  The  two 
nonparametric  context  models  had  an  AMI  with  the  supervised  model  of  about  0.3, 
which  is  much  lower  than  when  the  models  were  trained  on  the  background  spectra. 
Low  AMI  between  the  supervised  and  nonparametric  models  was  expected  because 
temporal  effects  were  eliminated. 

7.4.2  Context-Dependent  Band  Selection  Results 

As  was  done  previously  for  context-dependent  fusion  in  GPR,  the  linear  RVM  was 
used  as  a  context-specific  classifier  for  discriminating  targets  from  false  alarms  in  HSI 
data.  Target  features  were  extracted  from  each  image  chip  by  averaging  the  pixels 
in  the  center  region.  Separate  RYMs  were  learned  for  classifying  the  averaged  target 
spectra  in  each  context.  Because  the  priors  used  in  learning  promote  sparseness  in 
the  weights,  the  RVMs  will  apply  nonzero  weight  to  only  a  subset  of  the  70  spectral 
bands.  Therefore,  training  an  ensemble  of  RVMs  based  on  the  learned  contexts  will 
also  perform  band  selection  within  each  context. 

The  RVM  weights  corresponding  to  each  context  learned  from  the  background 
spectra  are  shown  in  Figure  7.11.  The  top  plot  illustrates  the  weights  for  each  of  the 
three  times  of  day.  Most  of  the  bands  receiving  nonzero  weight  are  towards  the  left, 
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which  roughly  correspond  to  the  “bump”  in  the  spectra  shown  in  Figure  7.3.  In  each 
context,  a  unique  subset  of  the  spectral  bands  receive  nonzero  discriminant  weight, 
suggesting  that  the  relevance  of  certain  bands  for  classifying  targets  from  clutter  vary 
with  time  of  day.  The  center  plot  illustrates  the  discriminant  weights  assigned  for 
each  of  the  generatively-learned  DPGMM  contexts,  which  are  less  sparse  than  those 
assigned  for  the  temporally-labeled  contexts.  The  bottom  plot  illustrates  the  weights 
obtained  for  each  of  the  discriminatively-learned  contexts,  and  they  appear  similar 
to  those  assigned  for  the  temporally-labeled  contexts.  Note  that  the  weights  for  the 
discriminatively-learned  contexts  appear  to  be  more  sparse  than  those  learned  for 
the  generatively-learned  contexts. 

Figure  7.12  illustrates  the  RVM  weights  corresponding  to  each  context  learned 
from  the  endmember  abundances.  The  top  plot  shows  the  weights  for  each  of  the 
three  temporally-labeled  contexts.  Note  that  the  weights  in  this  panel  are  different 
than  those  in  the  top  panel  of  Figure  7.11.  This  difference  suggests  that  subtracting 
the  mean  from  the  target  features,  which  eliminates  temporal  effects,  also  changes 
the  relevance  of  certain  spectral  bands  for  classification  purposes.  The  center  plot 
illustrates  the  RVM  weights  assigned  to  each  of  the  generatively-learned  DPGMM 
contexts,  and  they  appear  to  be  more  sparse  than  those  shown  in  the  center  panel 
of  Figure  7.11.  These  weights  also  appear  similar  to  those  in  the  bottom  panel, 
which  correspond  to  the  discriminatively-learned  contexts.  Note  that  in  the  case 
of  endmember-based  context  learning,  a  greater  number  of  contexts  were  learned. 
The  RVMs  learned  for  each  context  tend  to  be  very  sparse,  and  some  may  only 
assign  nonzero  weight  to  two  or  three  spectral  bands.  These  results  suggest  that 
the  reststrahlen  characteristics  of  disturbed  earth  may  manifest  itself  in  only  a  few 
spectral  bands  that  depend  on  the  local  soil  context. 
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RVM  Discriminant  Weights:  Supervised  Contexts  Learned  from  Background  Spectra 
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RVM  Discriminant  Weights:  DPGMM  Contexts  Learned  from  Background  Spectra 
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FIGURE  7.11:  RVM  discriminant  weights  corresponding  to  supervised  (top), 
DPGMM  (center),  and  discriminative  (bottom)  contexts  learned  from  background 
spectra.  The  horizontal  axes  represent  spectral  band,  and  the  vertical  axes  represent 
the  value  of  the  discriminant  weights.  Context  is  indicated  by  color. 
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RVM  Discriminant  Weights:  Supervised  Contexts  Learned  from  Endmember  Abundances 


RVM  Discriminant  Weights:  DPGMM  Contexts  Learned  from  Endmember  Abundances 
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RVM  Discriminant  Weights:  Discriminative  Contexts  Learned  from  Endmember  Abundance 
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FIGURE  7.12:  RVM  discriminant  weights  corresponding  to  supervised  (top), 
DPGMM  (center),  and  discriminative  (bottom)  contexts  learned  from  endmember 
abundances.  The  horizontal  axes  represent  spectral  band,  and  the  vertical  axes  rep¬ 
resent  the  value  of  the  discriminant  weights.  Context  is  indicated  by  color. 
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7-4-3  Detection  Performance 


The  discrimination  performance  of  context-dependent  learning  was  evaluated  via  10- 
folds  cross-validation  over  the  image  chips.  The  results  of  context-dependent  clas¬ 
sification  using  the  background  features  are  summarized  by  the  ROC  curves  shown 
in  Figure  7.13.  The  three  black  lines  correspond  to  detectors  from  the  literature 
that  attempt  to  mitigate  contextual  effects.  The  black  solid  line  illustrates  the  per¬ 
formance  of  the  RX  prescreener  [124],  the  black  dashed  line  illustrates  performance 
of  using  whitening/dewhitening  [129],  and  the  black  dotted  lines  illustrates  perfor¬ 
mance  of  the  mixture  of  Gaussians  technique  [128].  The  colored  lines  indicate  the 
performance  of  trained  classifiers.  The  blue  line  corresponds  to  the  performance  of  a 
single  linear  RVM  that  incorporates  no  contextual  information.  Context-dependent 
classification  with  the  generative  supervised  context  model  is  shown  in  magenta, 
and  context-dependent  classification  with  the  generative  DPGMM  context  model  is 
shown  in  red.  The  performance  of  discriminative  context-dependent  learning  with 
the  DPGMM-RVM  model  is  shown  in  green. 

The  order  of  performance  is  very  similar  to  the  GPR  results  that  were  shown  in 
previous  chapters.  Generative  context  learning  with  the  DPGMM  yields  the  most 
performance  improvement.  Even  at  high  PD,  the  ROC  curve  for  context-dependent 
classification  based  on  the  DPGMM  shows  the  most  reduction  in  PF.  Furthermore, 
all  three  context-dependent  classification  techniques  yielded  better  performance  than 
the  global  RVM,  but  the  degree  of  improvement  does  not  appear  to  be  substantial. 

The  results  of  context-dependent  learning  based  on  the  endmember  features  are 
summarized  by  the  ROC  curves  in  Figure  7.14.  Note  that  in  this  case,  subtracting 
the  mean  of  each  flyover’s  target  features  caused  the  global  RVM  to  perform  worse 
than  in  Figure  7.13.  However,  the  nonparametric  methods  yielded  much  greater 
improvement  over  the  single  RVM  than  they  did  when  using  background  features. 
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ROC  Curves:  Context-Dependent  Classification  of  HSI  with  Background  Features 


Figure  7.13:  ROC  curves  for  context-dependent  classification  of  HSI  data  us¬ 
ing  background  context  features.  Performance  of  the  RX  prescreener  (black  solid), 
whitening/dewhitening  (black  dashed),  mixture  of  Gaussians  (black  dotted),  global 
RVM  (blue),  generative  context-dependent  learning  with  supervised  (magenta) 
and  DPGMM  (red)  context  models,  and  discriminative  context-dependent  learning 
(green)  are  compared.  The  horizontal  axis  represents  probability  of  false  alarm  (PF) 
and  the  vertical  axis  represents  probability  of  detection  (PD). 


Furthermore,  the  generative  approach  with  the  DPGMM  context  model  did  not  yield 
the  greatest  performance  improvement  at  high  PD.  For  PDs  greater  than  approxi¬ 
mately  0.85,  discriminative  context  learning  illustrates  the  most  substantial  reduction 
in  PF.  These  results  suggest  that  if  all  HSI  observations  were  collected  under  simi¬ 
lar  lighting  and  temperature  conditions,  endmember-based  context  learning  has  the 
potential  to  substantially  improve  classification  performance. 
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ROC  Curves:  Context-Dependent  Classification  ofHSIwith  Endmember  Features 


Figure  7.14:  ROC  curves  for  context-dependent  classification  of  HS1  data  us¬ 
ing  endmember  context  features.  Performance  of  the  RX  prescreener  (black  solid), 
whitening/dewhitening  (black  dashed),  mixture  of  Gaussians  (black  dotted),  global 
RVM  (blue),  generative  context-dependent  learning  with  supervised  (magenta) 
and  DPGMM  (red)  context  models,  and  discriminative  context-dependent  learning 
(green)  are  compared.  The  horizontal  axis  represents  probability  of  false  alarm  (PF) 
and  the  vertical  axis  represents  probability  of  detection  (PD). 


7.5  Conclusions 

This  chapter  presented  another  application  of  context-dependent  learning  for  buried 
threat  detection  using  a  sensing  modality  complementary  to  GPR.  Airborne  HSI  is  a 
useful  sensing  technology  for  wide-area  assessment  whose  phenomenology  exploits  the 
reflectance  properties  of  disturbed  earth  known  as  the  reststrahlen  effect.  Therefore, 
HSI  sensors  such  as  AHI  that  are  tuned  to  special  reststrahlen  bands  for  disturbed 
earth  may  be  useful  in  buried  threat  detection  applications. 
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This  chapter  compared  two  approaches  to  context-dependent  classification  of  HSI 
chips  centered  on  anomalies  detected  by  the  RX  algorithm,  which  served  as  a  pre- 
screener.  In  both  approaches,  contextual  features  were  extracted  from  the  back¬ 
ground  data,  which  consisted  of  the  pixels  outside  of  the  5x5  center  region  of  each 
15  x  15  chip.  The  first  set  of  context  features  were  motivated  by  the  differences  in  the 
magnitude  of  background  spectra  at  different  times  of  day.  It  was  shown  that  spectra 
collected  in  the  afternoon  had  higher  magnitude  than  those  collected  at  morning  and 
night,  and  spectra  collected  at  morning  and  night  illustrated  magnitude  differences 
as  well.  Therefore,  context  features  were  extracted  by  averaging  the  pixels  in  the 
background  region  of  each  image  chip,  and  context  learning  was  performed  in  either 
a  supervised  manner  using  qualitative  time-of-day  labels,  or  through  nonparametric 
models  such  as  the  DPGMM  and  the  discriminative  DPGMM-RVM.  Results  of  con¬ 
text  learning  showed  that  the  different  times  of  day  were  easily  characterized,  and 
all  three  context  learning  techniques  had  high  mutual  information. 

The  second  context  learning  approach  considered  the  case  where  there  were  no 
temporal  effects.  For  the  purposes  of  simulating  this  case,  the  means  of  the  back¬ 
ground  and  target  data  were  subtracted  for  each  flyover.  Features  were  extracted 
from  the  background  data  via  spectral  unmixing.  The  ICE  algorithm  was  used  to 
learn  four  endmember  spectra  from  all  chips’  averaged  background  spectra.  The 
endmember  abundances  were  projected  onto  the  corresponding  simplex,  and  context 
learning  was  performed  on  the  resulting  3-D  features  using  supervised  and  nonpara¬ 
metric  methods.  Context  learning  results  illustrated  that  although  effects  of  tem¬ 
poral  context  were  eliminated,  nonparametric  context  learning  found  many  distinct 
clusters  in  the  endmember  features. 

Experimental  results  compared  the  performance  of  context-dependent  classifica¬ 
tion  using  the  background  and  endmember-based  context  features.  Results  form 
using  background  features,  which  was  expected  to  exploit  temporal  differences  in 
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spectral  signatures,  illustrated  that  context-dependent  learning  did  not  yield  much 
improvement  over  the  single  RVM.  However,  in  the  case  where  temporal  effects  were 
removed  and  endmember  features  were  used,  context-dependent  learning  yielded  sub¬ 
stantial  improvements  over  the  RVM  and  discriminative  context-dependent  learning 
showed  the  best  performance  at  high  PD.  These  results  illustrate  that  despite  using 
similar  context-dependent  learning  approaches,  the  degree  of  improvement  over  con¬ 
ventional  classification  can  be  highly  dependent  on  the  contextual  information  being 
exploited. 


190 


8 


Conclusions  and  Future  Work 


In  this  dissertation,  a  variety  of  nonparametric  Bayesian  methods  for  context  learning 
were  proposed  for  improving  the  robustness  of  sensor  systems  used  to  detect  buried 
explosive  threats  such  as  landmines  or  IEDs.  However,  the  novel  contributions  of  this 
work  have  broader  application  to  a  variety  of  current  research  areas.  The  following 
subsections  summarize  these  contributions,  propose  avenues  for  future  work,  and 
discuss  the  broader  implications  of  the  novel  context-dependent  models  that  were 
developed. 

8.1  Summary  of  Contributions 

In  Chapter  1,  the  threat  of  buried  explosives  was  introduced  as  a  problem  of  major 
concern  to  militaries  and  humanitarian  organizations.  GPR  was  then  introduced  as 
a  valuable  tool  in  detecting  landmines  and  IEDs,  since  its  phenomenology  enables 
the  detection  of  nonmetal  objects.  However,  the  phenomenology  of  GPR  also  makes 
it  sensitive  to  effects  from  many  aspects  of  the  subsurface  environment.  Particular 
attention  was  paid  to  the  effects  of  soil  moisture  [17-20],  rough  surface  scatter¬ 
ing  [23-28],  and  subsurface  heterogeneity  [22,28].  Although  techniques  based  on 
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electromagnetic  model  inversion  have  been  proposed  for  inferring  an  object’s  true 
size  and  shape  based  on  noisy  GPR  responses  [29-31],  these  types  of  approaches 
are  computationally  slow  and  require  a  priori  measurements  of  a  target’s  scattering 
properties.  Because  military  route  clearance  requires  real-time  processing,  and  the 
IED  threat  is  constantly  redefining  itself,  iterative  model  inversion  may  not  be  the 
best  approach  to  improving  detection  across  varying  environments. 

In  this  work,  a  Bayesian  learning  framework  referred  to  as  context-dependent  clas¬ 
sification  was  proposed  as  a  technique  for  maintaining  robust  performance  across 
varying  environments.  Traditionally,  statistical  classification  would  be  performed  on 
a  set  of  target  features  designed  for  characterizing  target  signatures  from  clutter. 
However,  in  varying  environments  there  can  be  significant  class  overlap  in  the  target 
feature  space  that  cannot  be  modeled  by  a  linear  decision  boundary.  In  context- 
dependent  classification,  a  set  of  secondary  context  features  were  proposed  for  clus¬ 
tering  observations  collected  under  similar  environmental  conditions.  By  condition¬ 
ing  the  classifiers  operating  in  target-space  on  the  clusters  learned  in  context-space, 
a  complex  nonlinear  classification  problem  could  potentially  be  broken  down  into 
several  simpler  linear  ones  that  are  motivated  by  changes  in  the  ambient  sensing 
environment. 

Although  other  researchers  have  proposed  context-based  learning  techniques  in 
the  literature  [64,65],  the  definition  of  context  used  in  this  work  differs  from  those 
used  in  the  past.  In  this  work,  context  is  motivated  by  the  physical  state  of  the  world 
from  which  an  observation  was  drawn,  and  not  from  the  properties  of  the  observation 
itself.  Regardless  of  whether  a  target  is  present  at  a  particular  location,  the  context  of 
that  location  is  still  the  same.  It  was  therefore  proposed  that  contextual  information 
can  be  extracted  from  the  background  sensor  data  by  exploiting  a  priori  knowledge 
about  the  underlying  phenomenology. 

In  Chapter  2,  several  physically-motivated  contextual  features  were  proposed  for 
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training  a  statistical  context  model.  The  features  were  based  upon  a  transmission 
line  model  for  GPR  A-scans.  Although  using  the  transmission  line  model  implies  ma¬ 
jor  simplifying  assumptions  about  the  physics  of  wave  propagation,  deviations  of  the 
model  from  reality  could  be  accounted  for  by  analysis  of  the  features’  statistics.  A  va¬ 
riety  of  features  were  proposed  to  characterize  different  soil  properties.  For  example, 
energy  feature  was  proposed  for  characterizing  such  as  soil  permittivity,  conductiv¬ 
ity,  and  heterogeneity.  A  feature  based  on  the  reflection  coefficient  was  proposed  for 
characterizing  the  dielectric  contrast  at  the  air/ground  interface.  To  compute  the 
reflection  coefficient,  the  ground  bounce  was  isolated  and  basic  radar  ranging  was 
applied.  To  characterize  soil  heterogeneity,  features  based  on  the  matching  pursuits 
algorithm  were  proposed  for  estimating  the  number  of  unique  reflections  that  make 
up  a  single  A-scan.  Finally,  features  based  on  linear  prediction  were  proposed  for 
characterizing  the  stochastic  properties  of  the  background. 

To  evaluate  the  performance  of  these  features  in  characterizing  quantitative  soil 
properties,  experiments  were  performed  using  simulated  and  field-collected  GPR 
data.  For  these  experiments,  the  features  were  extracted  from  GPR  data  free  of 
landmine  signatures,  and  statistical  regression  and  classification  models  were  used  to 
predict  known  soil  properties  from  the  features.  In  Section  2.3,  it  was  shown  that  the 
features  were  informative  in  predicting  soil  dielectric  constant,  conductivity,  surface 
correlation  length  (roughness),  and  the  expected  number  of  subsurface  scatterers 
(heterogeneity)  from  simulated  GPR  data.  The  results  of  the  experiment  on  field- 
collected  data  were  presented  in  Section  2.4.  In  this  experiment,  the  features  were 
shown  to  be  informative  in  estimating  measurements  of  soil  moisture  and  temperature 
collected  from  a  nearby  meteorological  station.  These  results  represent  the  first 
successful  application  of  statistical  inference  for  identifying  soil  properties  from  GPR 
features  that  are  easy  to  extract  in  real-time  operation. 

In  context-dependent  classification,  the  contextual  features  were  used  to  train  a 
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statistical  context  model  to  partition  the  training  data  into  M  clusters  known  as  con¬ 
texts.  Because  the  contextual  features  were  shown  to  be  characteristic  of  quantitative 
soil  properties,  performing  clustering  in  that  space  can  group  together  observations 
that  were  collected  in  similar  environments.  After  learning  the  individual  contexts, 
a  unique  classifier  (the  RVM  [83]  in  this  work)  can  be  trained  on  the  target  features 
for  discriminating  targets  from  clutter  in  that  context.  In  this  work,  the  RVMs  were 
trained  on  the  confidence  values  of  four  currently-fielded  detection  algorithms  -  a 
process  referred  to  as  context-dependent  fusion. 

In  Chapter  3,  two  basic  context  models  were  presented.  The  first  was  a  supervised 
context  model  based  on  a  Gaussian  hypothesis  test  between  known  qualitative  soil 
labels:  dirt,  gravel,  asphalt,  and  concrete.  By  projecting  the  contextual  features 
to  3-D  via  PGA,  and  learning  a  Gaussian  distribution  for  each  labeled  soil  type, 
test  observations  were  classified  according  to  the  most  likely  soil  type.  Although 
this  approach  yielded  excellent  performance  in  distinguishing  the  four  different  soils, 
supervised  context  learning  is  highly  dependent  on  quality  of  the  labels.  Therefore, 
another  basic  context  model  was  proposed  based  on  unsupervised  learning,  which 
is  performed  without  labels.  Basic  unsupervised  context  learning  was  performed 
by  estimating  the  parameters  of  an  M-order  GMM  from  the  contextual  features. 
Although  this  approach  was  able  to  sub-divide  each  of  the  four  soils  into  multiple 
sub-clusters  that  could  provide  more  physically-meaningful  contextual  information, 
choosing  the  order  of  the  model  is  a  separate  and  difficult  task.  As  a  result,  it 
was  concluded  that  context-dependent  classification  could  potentially  benefit  from 
context  models  that  facilitate  learning  the  number  of  contexts,  in  addition  to  their 
statistical  distribution  in  feature  space. 

Several  Bayesian  approaches  for  learning  nonparametric  context  models  were  pro¬ 
posed  in  Chapters  4,  5,  and  6.  The  common  factor  between  the  proposed  nonpara¬ 
metric  context  models  were  that  they  were  all  infinite-order  probabilistic  mixtures 
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that  incorporated  Dirichlet  process  (DP)  priors  [103].  The  DP  prior  was  used  to 
control  model  complexity  and  facilitate  learning  of  an  effective  model  order.  This 
property  was  illustrated  through  discussion  of  the  Chinese  restaurant  process  and 
stick-breaking  process  in  Section  4.2.  Because  posterior  inference  cannot  be  per¬ 
formed  analytically  for  nonparametric  mixture  models,  variational  Bayesian  (VB) 
inference  was  used  to  perform  approximate  inference.  An  overview  of  VB  was  given 
in  Section  E.10. 

In  Chapter  4,  two  models  were  proposed  for  generative  nonparametric  context 
learning.  These  models  were  learned  on  the  context  features  alone,  without  regard 
to  the  class  (target/clutter)  labels  or  the  target  features.  The  first  model  was  the 
DPGMM  [67],  which  was  able  to  learn  an  effective  number  of  Gaussian  contexts 
without  having  to  specify  the  number  of  contexts  a  priori.  The  second  model  was 
the  DPMFA  (adapted  from  [68,94])  which  lifted  the  restriction  of  having  all  contexts 
use  the  same  underlying  low- dimensional  projection.  The  DPMFA  was  used  to  learn 
the  number  of  contexts  as  well  as  the  number  of  latent  factors  that  characterize  each. 
The  DPGMM  and  DPMFA  were  both  used  to  perform  context-dependent  fusion  and 
the  results  were  compared.  Context-dependent  fusion  using  either  model  performed 
significantly  better  than  a  single  RVM  incorporating  no  contextual  information,  but 
using  the  DPGMM  led  to  better  performance. 

Chapter  5  explored  discriminative  nonparametric  context  learning.  In  contrast  to 
generative  context  learning,  which  treated  clustering  in  the  context  features  and  dis¬ 
crimination  in  the  target  features  as  independent  tasks,  discriminative  context  learn¬ 
ing  performed  both  tasks  jointly.  Two  methods  for  discriminative  context  learning 
were  compared.  The  first,  referred  to  as  the  DPGMM-RVM,  consisted  of  a  mixture  of 
RVMs  with  a  DPGMM  gating  network.  The  DPGMM-RVM  was  shown  to  perform 
clustering  in  target  space  while  also  training  the  classifiers  in  target  space.  The  other 
method  was  the  IQGME,  which  was  originally  proposed  in  [94]  for  classification  with 
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missing  data.  In  contrast  to  the  modified  DPGMM-RVM,  the  IQGME  performed 
classification  and  clustering  in  the  joint  feature  space  formed  by  concatenating  the 
context  and  target  features.  The  IQGME  also  did  not  utilize  sparse  component  clas¬ 
sifiers.  A  series  of  synthetic  data  examples  compared  the  performance  of  the  two  dis¬ 
criminative  context  models  under  various  scenarios.  Furthermore,  their  performance 
in  context-dependent  fusion  were  compared.  Although  both  showed  significant  im¬ 
provement  over  the  single  RVM  at  many  points  on  the  ROC  curve,  performance  only 
exceeded  that  of  generative  context  learning  at  low  PD  levels.  These  results  suggested 
that  if  the  contextual  features  are  effective  in  generativcly  clustering  according  to 
relevant  contextual  factors,  discriminative  context  learning  may  be  unnecessary. 

The  idea  of  context  as  a  spatially-varying  property  was  explored  further  in  Chap¬ 
ter  6.  In  this  chapter,  contextual  features  were  extracted  from  the  background  at 
regular  intervals.  This  feature  extraction  technique  was  referred  to  as  context  sam¬ 
pling.  By  sampling  context  throughout  all  space,  context  was  decoupled  from  the 
anomalies  being  classified.  Instead,  context  was  learned  for  large  stretches  of  target- 
free  data  so  that  when  an  anomaly  was  encountered,  its  context  would  already  have 
been  inferred.  Two  spatial  context  models  were  proposed.  The  first  was  based  on 
the  DPGMM,  but  was  trained  on  features  extracted  through  context  sampling.  Al¬ 
though  the  DPGMM  was  trained  on  samples  collected  a  large  area,  in  the  statistical 
sense  each  sample  was  treated  as  an  independent  observation.  Therefore,  a  context 
model  based  on  HMMs  was  also  considered  for  incorporating  the  spatial  dependency 
of  samples  into  inference. 

Spatially-dependent  context  models  have  a  physical  motivation,  since  many  con¬ 
textual  factors  (such  as  soil  moisture)  may  be  localized  to  a  certain  area.  Therefore, 
it  may  be  preferable  to  use  more  information  from  nearby  locations  when  inferring 
the  context  of  the  present  location.  For  context  modeling,  the  SBHMM  [69]  was 
used  as  a  nonparametric  extension  of  the  HMM  that  allowed  for  the  inference  of 
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the  effective  number  of  spatially-varying  states.  The  performance  of  the  DPGMM 
and  SBHMM  in  spatial  context  modeling  were  compared,  and  it  was  shown  that  the 
SBHMM  favored  sharp  transitions  between  different  contexts  while  the  DPGMM 
favored  more  gradual  transitions.  With  regard  to  context-dependent  fusion  perfor¬ 
mance,  both  spatial  context  models  provided  additional  performance  improvements 
over  the  alarm-based  techniques  used  in  previous  chapters.  However,  the  DPGMM 
appeared  to  perform  more  consistently  than  the  SBHMM  when  multiple  realizations 
of  the  models  were  compared. 

Finally,  in  Chapter  7,  the  context-dependent  classification  framework  originally 
developed  for  GPR  was  applied  to  buried  threat  detection  in  airborne  HSI  data. 
Although  the  same  statistical  framework  was  applicable  to  this  problem,  different 
contextual  factors  needed  to  be  exploited  because  the  phenomenology  of  HSI  dif¬ 
fers  from  that  of  GPR.  Two  approaches  were  considered  for  extracting  contextual 
information  from  HSI  data.  The  first  utilized  the  averaged  background  spectra  near 
detected  anomalies,  which  was  indicative  of  the  relative  time  of  day  (morning,  af¬ 
ternoon,  or  night).  However,  it  was  also  important  to  consider  training  data  that 
did  not  exhibit  such  drastic  temporal  differences.  Therefore,  the  second  approach 
extracted  contextual  features  from  the  background  using  spectral  unmixing  to  yield 
the  local  abundances  of  several  constituent  endmember  spectra.  Context-dependent 
classification  was  performed  on  HSI  data  using  the  supervised,  generative  DPGMM, 
and  discriminative  DPGMM-RVM  modeling  techniques.  For  both  sources  of  con¬ 
textual  information,  performance  was  improved  over  a  conventional  linear  classifier. 
However,  context  modeling  based  on  spectral  unmixing  led  to  greater  improvements 
in  performance. 
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8.2  Considerations  for  Fielded  Systems 


Although  the  models  and  algorithms  proposed  in  this  dissertation  were  designed 
with  fielded  application  (e.g.,  HMDS)  in  mind,  a  variety  of  factors  have  not  yet  been 
considered.  In  particular,  greater  attention  should  be  paid  to  improving  the  efficiency 
of  contextual  feature  extraction.  The  energy  and  reflection  coefficient  features  require 
only  simple  calculations,  but  the  process  of  extracting  the  matching  pursuits  and 
linear  prediction  features  must  be  improved  for  real-time  use.  The  efficiency  of 
matching  pursuits  can  be  improved  dramatically  by  careful  design  of  the  dictionary. 
This  can  include  restricting  the  number  of  elements  by  limiting  the  number  of  pulse 
locations  (in  time),  as  well  as  adjusting  the  width  of  the  pulses  to  better-reflect  the 
pulses  that  make  up  a  GPR  A-scan.  In  this  work,  the  pulse  width  was  set  to  a  single 
value  that  generally  matched  the  width  of  the  transmitted  differentiated-Gaussian 
pulse.  However,  dispersion  effects  in  soil  propagation  are  inevitable,  and  the  width 
of  received  pulses  may  change  with  time.  Better  understanding  of  this  phenomenon 
may  allow  for  the  matching  pursuits  dictionary  to  be  designed  to  better-reflect  the 
structure  of  GPR  A-scans. 

The  process  of  extracting  contextual  features  based  on  linear  prediction,  as  imple¬ 
mented  in  this  work,  involved  training  autoregressive  models  on  individual  segments 
of  GPR  data.  In  the  case  of  alarm-based  context  learning,  these  segments  consisted 
of  the  100  A-scans  collected  before  an  alarm.  In  spatial  context  modeling,  the  seg¬ 
ments  were  the  100  A-scans  collected  before  the  current  background  sample,  and 
therefore  much  of  the  data  used  to  compute  these  features  at  subsequent  samples 
was  redundant.  Because  linear  prediction  filtering  is  also  a  major  component  of  the 
HMDS  prescreening  algorithm  (see  [38]),  it  may  be  more  efficient  to  incorporate  the 
prescreener’s  internal  calculations  into  contextual  feature  extraction.  However,  the 
prescreener  was  treated  in  this  work  as  a  “black  box”  and  this  idea  was  not  explored. 
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Aside  from  feature  extraction,  as  well  as  algorithm  training  (which  is  meant  to 
be  performed  offline),  all  computations  involve  linear  computations  and/or  canon¬ 
ical  probability  density  functions  (see  Appendix  A).  Therefore,  if  the  efficiency  of 
feature  extraction  can  be  improved,  context-dependent  classification  as  proposed  in 
this  dissertation  should  be  implementable  for  real-time  processing  on  a  fielded  sys¬ 
tem.  Performance,  obviously,  is  dependent  on  sufficient  training.  The  data  used 
to  train  the  algorithms  used  in  this  work  was  collected  in  2009  on  domestic  mili¬ 
tary  reservations,  which  present  operating  conditions  that  are  more  ideal  than  field 
conditions.  Furthermore,  although  the  target  population  consisted  of  real  anti-tank 
landmines  and  a  variety  of  simulated  IEDs,  it  is  a  limited  subset  of  the  actual  buried 
explosive  threat.  Recall  from  Chapter  1  that  the  IED  threat  is  constantly  changing 
and  adapting  to  countermeasures.  It  is  important  that  if  context-dependent  learn¬ 
ing  (and  other  buried  threat  detection  algorithms)  were  to  be  deployed  in  a  fielded 
system,  the  training  data  reflect  field  conditions  as  closely  as  possible. 

8.3  Future  Work 

Beyond  the  questions  of  how  to  improve  the  efficiency  of  context-dependent  fusion 
for  fielded  GPR  systems,  several  unanswered  theoretical  questions  should  be  the  fo¬ 
cus  of  future  work.  Future  considerations  must  consider  improving  learning  through 
sampling  methods,  as  well  as  explore  new  challenges  such  as  discriminative  spatial 
context  learning,  comparing  context-dependent  learning  to  nonlinear  classification 
models,  and  determining  whether  there  is  potential  for  online  Bayesian  context  learn¬ 
ing  via  nonparametric  models. 

Although  context  learning  was  performed  using  VB  in  this  work,  there  is  no  reason 
why  Markov  chain  Monte  Carlo  (MCMC)  techniques  such  as  Gibbs  sampling  cannot 
be  used  for  approximate  inference.  It  is  well-known  that  MCMC  is  more  accurate 
than  VB,  since  it  is  based  on  sampling  the  posterior  densities  rather  than  iteratively 
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optimizing  a  lower  bound  from  a  randomized  initial  solution.  Therefore,  MCMC  is 
not  susceptible  to  converging  to  a  local  optimum  solution,  but  this  benefit  comes 
at  the  expense  of  greater  computational  cost.  However,  because  context  learning  is 
performed  offline  in  this  work,  the  greater  computational  cost  of  MCMC  should  not 
be  a  factor. 

An  question  arising  from  the  conclusions  from  Chapter  6  is  whether  a  sequen¬ 
tial  context  model  can  be  learned  discriminatively.  Just  as  the  DPGMM-RVM  was 
used  as  a  discriminative  context  model  in  Chapter  5,  it  may  be  possible  to  learn  a 
discriminative  SBHMM-RVM  context  model.  A  hybrid  HMM-HME  was  originally 
proposed  in  [61]  for  speech  recognition  applications.  Generalizing  this  model  to  ac¬ 
commodate  a  nonparametric  HMM  (i.e.,  SBHMM)  and  a  mixture  of  sparse  classifiers 
(i.e.,  RVMs)  would  be  both  academically  interesting  and  practical  for  buried  threat 
detection  and  speech  recognition  alike. 

In  all  performance  comparisons,  context-dependent  classification  was  compared 
to  a  single  linear  RVM  that  incorporated  no  contextual  information.  Comparisons  to 
nonlinear  classifiers,  such  as  a  kernel  RVM,  using  the  combined  target  and  contextual 
features  were  never  made.  Although  nonlinear  classifiers  may  be  competitive  with 
alarm-based  context-dependent  learning,  they  would  do  so  by  including  context  as 
additional  features  of  an  observation  rather  than  the  state  of  the  world  at  a  given 
location.  The  comparisons  made  to  IQGME  in  Chapter  5  address  this  issue,  illus¬ 
trating  the  advantages  and  disadvantages  of  performing  context  learning  in  separate 
feature  spaces  for  alarm-based  classification.  However,  it  would  be  difficult  to  adapt 
IQGME  or  a  nonlinear  classifier  that  utilizes  contextual  information  in  the  same 
fashion  as  the  spatial  context  models  presented  in  Chapter  6.  By  treating  context 
as  a  property  of  a  continuously-varying  environment,  and  not  a  property  of  discrete 
observations  within  that  environment,  context-dependent  learning  satisfies  our  intu¬ 
ition  in  a  way  that  nonlinear  classification  does  not.  Future  work  should  consider  a 
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series  of  synthetic  data  experiments,  and  other  real-world  examples  outside  of  buried 
threat  detection,  to  illustrate  this  important  difference  between  context-dependent 
learning  and  standard  nonlinear  classification. 

Future  investigations  must  also  consider  what  to  do  when  new  contexts  are  en¬ 
countered  in  the  held.  This  is  a  legitimate  question,  since  the  fielded  algorithm 
would  be  trained  on  domestic  data  collected  under  somewhat  idealized  conditions. 
If  a  system  using  context-dependent  classification  as  proposed  in  this  work  were  to 
enter  a  previously- unseen  context,  the  contextual  features  extracted  from  the  data 
would  appear  to  be  statistical  outliers.  Although  the  likelihood  of  being  in  any  one 
of  the  known  contexts  would  be  very  small,  the  differences  in  likelihood  between 
the  contexts  could  be  an  order  of  magnitude  (i.e.  10-4  versus  10~6)  and  posterior 
inference  would  favor  with  surety  the  context  with  the  greater  likelihood.  This  could 
be  modified  by  imposing  a  likelihood  threshold,  and  if  the  likelihoods  of  all  contexts 
fall  below  it  each  one  would  be  treated  as  equally-unlikely.  This  would  result  in 
the  system  behaving  in  these  conditions  as  if  it  were  incorporating  no  contextual 
information  at  all. 

However,  online  context  learning  may  be  a  more  attractive  option  for  dealing 
with  newly-encountered  environments.  This  type  of  learning  may  be  supported  by 
the  Dirichlet  process.  Recall  the  Chinese  restaurant  process;  as  more  people  enter  the 
restaurant,  tables  that  were  empty  at  one  time  will  fill  up  as  time  progress.  Therefore, 
as  more  data  is  collected  in  the  held,  new  context  distributions  can  potentially  be 
learned.  To  make  online  context  learning  viable,  VB  must  be  used  to  conserve 
processing  resources.  However,  online  VB  for  nonparametric  models  is  still  being 
explored  for  previously-developed  nonparametric  models  [138,139].  It  should  also 
be  noted  that  the  Chinese  restaurant  process  does  not  support  customers  moving 
between  tables.  This  suggests  that  although  the  DP  may  be  a  useful  prior  in  learning 
new  contexts  as  they  are  encountered,  it  may  present  difficulties  in  forgetting  contexts 
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that  are  never  seen  in  the  field. 


8.4  Broader  Applications 

Context-dependent  classification  as  proposed  in  this  dissertation  may  have  broader 
application  to  areas  outside  of  GPR  and  HSI  sensing.  The  most  general  implication 
of  this  work  is  the  notion  that  contextual  factors  can  be  embraced,  rather  than 
mitigated,  to  improve  performance.  This  concept  can  be  applied  to  a  variety  of 
statistical  learning  applications  in  the  sensing  field  and  beyond. 

An  example  of  another  sensing  technology  that  may  benefit  from  context-dependent 
learning  is  laser  induced  breakdown  spectroscopy  (LIBS),  which  has  shown  potential 
for  use  in  “fingerprinting”  different  chemical  compounds  [140,141],  LIBS  operates  by 
focusing  a  highly-powered  laser  onto  a  material  to  form  a  plasma.  The  plasma  emits 
a  distinctive  spectrum  that  characterizes  the  material’s  chemical  composition.  In 
applications  of  LIBS  to  classify  classifying  chemical,  biological,  radiological,  nuclear, 
and  explosive  (CBRNE)  residues,  plasma  will  also  be  formed  from  the  background 
substrate,  resulting  in  the  background  spectrum  being  mixed  with  the  spectrum  of 
interest.  Preliminary  studies  based  on  the  work  presented  in  this  dissertation  have 
suggested  that  residues  could  be  better-classified  by  LIBS  if  the  spectrum  of  the 
background  is  correctly  identified  first,  suggesting  that  fieldability  of  LIBS  sensors 
may  be  improved  by  embracing  a  context-dependent  treatment  of  the  classification 
problem  [142], 

Another  area  that  may  benefit  from  context-dependent  classification  may  be  in 
neurological  prostheses,  such  as  brain-computer  interfaces  (BCI).  Systems  such  as  the 
P300  speller  exploit  features  of  electroencephalogram  (EEG)  signals,  recorded  by  an 
electrode  cap  worn  by  an  amyotrophic  lateral  sclerosis  (ALS)  patient,  to  select  char¬ 
acters  as  they  become  highlighted  on  a  computer  display  [143].  A  problem  currently 
being  investigated  for  BCI  is  channel  selection,  i.e.  determining  which  electrodes  are 
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most  informative  for  identifying  when  the  correct  character  was  selected  [144],  Just 
as  a  -priori  knowledge  of  GPR  and  HSI  phenomenology  was  leveraged  as  a  source  of 
contextual  information  for  improving  detection  of  buried  objects  in  that  data,  neu¬ 
roscience  may  be  a  source  of  contextual  information  for  performing  channel  selection 
in  BCI.  By  anticipating  which  section  of  the  brain  would  yield  the  most  informative 
response,  character  identification  performance  and  overall  system  throughput  can 
potentially  be  improved. 

Finally,  context-dependent  learning  may  someday  find  applications  in  the  broader 
area  of  statistical  data  mining.  Many  experts  have  noted  that  society  is  entering 
the  age  of  big  data,  and  virtually  all  industries  are  demanding  intelligent  process¬ 
ing  solutions  to  facilitate  decision-making  [145].  This  problem  has  been  brought 
to  mainstream  attention  over  recent  years  through  a  highly-publicized  data  mining 
competition  spearheaded  by  Netflix,  which  sought  to  improve  its  movie  recommen¬ 
dation  algorithm  [146].  As  more  customer  data  becomes  available  through  Internet 
transactions  and  social  networks,  online  service  providers  will  have  more  consumer 
information  available  to  them  than  ever  before.  For  example,  in  recommending  mu¬ 
sic  to  listeners,  leading  algorithms  identify  the  genre  of  a  recording  solely  based  on 
frequency-domain  features  of  the  audio  signal  (e.g.,  [113,147]).  However,  valuable 
contextual  information  is  also  available  through  associated  metadata  (artist  biog¬ 
raphy,  lyrics,  subject  matter)  as  well  as  in  complementary  media  such  as  books  or 
films  that  the  customer  also  enjoys.  The  consumption  of  media  by  a  user’s  friends 
may  also  be  a  source  of  contextual  information  in  making  a  recommendation  to  a 
particular  user. 

In  conclusion,  significant  contributions  to  the  remote  sensing  and  machine  learn¬ 
ing  fields  were  made  through  this  work.  Several  Bayesian  techniques  for  context 
learning  were  proposed,  and  they  were  shown  to  provide  useful  information  to  im¬ 
prove  the  performance  of  GPR  and  HSI  systems  used  for  detecting  landmines  and 
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IEDs.  Success  was  achieved  in  further  bridging  the  fields  of  physics  and  statistics. 
It  was  illustrated  that  the  statistics  of  data  obtained  through  physical  phenomena 
that  are  often  overlooked  can,  in  fact,  be  leveraged  in  making  decisions.  The  results 
presented  throughout  this  dissertation  demonstrated  that  by  thinking  ’’out  of  the 
box”,  and  approaching  a  problem  from  a  different  angle  than  previous  researchers, 
fielded  technologies  can  continue  to  be  improved  upon. 
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Appendix  A 

Probability  Distributions 


The  probability  distributions  for  all  random  variables  considered  in  this  dissertation 
are  presented  in  this  appendix.  For  each  distribution,  the  functional  form  of  the 
PDF,  descriptions  of  the  parameters,  and  moments  necessary  for  all  calculations  are 
provided.  The  written  formats  of  the  PDFs  are  based  on  Bishop’s  text  [71]. 

A.l  Bernoulli  Distribution 

The  Bernoulli  distribution  is  for  a  single  binary  variable,  x  G  (0, 1},  representing 
either  a  positive  or  null  outcome  of  an  experiment.  The  random  variable,  x ,  is 
denoted  as  Bernoulli-distributed  by 

x  ~  Bernoulli  (x\9) .  (A.l) 

A.  1.1  Parameters 

The  parameter  of  the  Bernoulli  distribution  is  6 ,  such  that 

9  —  p  (x  —  1) .  (A. 2) 
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A.  1.2  Probability  Density  Function 

The  density  function  for  the  Bernoulli  distribution  is 

P  (x\p)  —  xp  (1  —  x)l~p  .  (A. 3) 

A.  1.3  Moments 

The  mean  and  variance  of  the  Bernoulli  distribution  are 

E  [x]  =  9,  (A. 4) 

Var  [x]  =  9(1-9).  (A.5) 

A.  1.4  Kullback-Leibler  Divergence 

The  Kullback-Leibler  Divergence  (KLD)  between  two  Bernoulli  distributions  is 

KLB  [q  (x\9q)  \  \p  (x\9p)}  =  9q  log  ^  +  (1  -  9q)  log  j  0q .  (A.6) 

Up  1  Up 

A. 2  Binomial  Distribution 

The  binomial  distribution  gives  the  probability  of  observing  x  positive  Bernoulli  trials 
in  N  experiments.  The  random  variable,  x ,  is  denoted  as  binomial-distributed  by 

x  ~  Binomial  (x|lV,  9) .  (A. 7) 

A.  2.1  Parameters 

Like  the  Bernoulli  distribution,  the  parameter  of  the  binomial  distribution  is  9,  such 
that 

0  <  9  <  1  (A. 8) 
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A. 2. 2  Probability  Density  Function 


The  density  function  for  the  binomial  distribution  is 

p  (x\N,  6)  =  ^  ex  (1  -  0)N~X  .  (A. 9) 

A.  2. 3  Moments 

The  mean  and  variance  of  the  binomial  distribution  are 


E  [x]  =  N9,  (A.  10) 

Var  [x]  =  NO  (1  -  9) .  (A.ll) 

A. 2. 4  Kullback-Leibler  Divergence 

The  Kullback-Leibler  Divergence  (KLD)  between  two  Binomial  distributions  is: 
KLD  \q  MAT,,  #,)  I  |p  Or|JVp,  Dr)}  =  £  AV  (1  log 

i=0  '  1  / 


(A- 12) 


A. 3  Multinomial  Distribution 


The  multinomial  distribution  is  a  multivariate  generalization  of  the  Bernoulli  distri¬ 
bution  to  a  D-dimensional  binary  variable  x,  with  elements  Xd  €  {0, 1}  constrained 
to  sum  to  unity,  i.e.  J2dxd  =  1-  The  random  vector,  x,  is  denoted  as  multinomial- 
distributed  by 

x  ~  Multinomial  (x|0) .  (A. 13) 

A.  3.1  Parameters 

The  parameter  of  the  multinomial  distribution  is  the  probability  vector  6 ,  whose 
elements  must  satisfy  the  following: 

0  <  ed  <  1,  d=  1,2,...,  A  (A.  14) 
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(A. 15) 


D 


E««  =  L 


d=  1 


A. 3. 2  Probability  Density  Function 


The  density  function  for  the  multinomial  distribution  is 


D 


j>(xi«)  =  n«r 


d=l 


A.  3. 3  Moments 


The  mean  and  variance  of  the  multinomial  distribution  are: 


(A. 16) 


E  [xd]  =  0d 


(A.  17) 


Var  [xd]  =  9d  (1  -  0(;) 


(A.  18) 


A. 3. 4  Kullback-Leibler  Divergence 

The  Kullback-Leibler  Divergence  (KLD)  between  two  Multinomial  distributions  is: 


D 


KLD  [ q  (x|09)  1 1 p  (x|  Op)  =  E  eidl° S 


d=  1 


Qd 


'Pd 


(A. 19) 


A. 4  Beta  Distribution 


The  Beta  distribution  is  for  the  continuous  variable,  x  G  [0,1].  Since  the  Beta 
distribution  has  finite  support  between  zero  and  one,  it  is  often  used  to  represent 
uncertainty  in  the  probability  of  an  event.  The  random  variable,  x,  is  denoted  as 
Beta-distributed  by 

x  ~  Beta  (x\a,  b) .  (A. 20) 
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A.  4-1  Parameters 


The  parameters  of  the  Beta  distribution  are  a  and  b ,  such  that 


a  >  0, 


(A. 21) 


b  >  0. 


(A. 22) 


A. 4. -2  Probability  Density  Function 


The  density  function  for  the  Beta  distribution  is 


p  (x\a,  b)  = 


T  (a  +  b)  x  ^  _A&- 1 


T  (a)  T  (6) 


xa  (1  —  x) 


(A. 23) 


where  T  (•)  denotes  the  Gamma  function  given  by 


T  (z)  =  /  e~Hz~ldt 

Jo 


(A. 24) 


A.  4-3  Moments 


The  mean  and  variance  of  the  Beta  distribution  are: 


E  [xl  = 


a  +  b 


(A. 25) 


Var  lx]  = 


ab 


(yd  +  6)  ( a  b  H-  1) 


(A. 26) 


A. 4-4  Kullback-Leibler  Divergence 


The  Kullback-Leibler  Divergence  (KLD)  between  two  Beta  distributions  is. 
KLB  [q(x\aq,bq)  \\p  (x\ap,  bp)}  =  log  ^ ^  ^  +  log  +  log 

r  (ap  +  bp)  r  (a,)  r  (&,) 

+  [a,  -  Op]  (Oq)  -  i>  (Oq  +  &q)] 

+  [6?  -  6p]  (6?)  -  ^  (Oq  +  &,)]  • 


(A. 27) 
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A.  5  Dirichlet  Distribution 


The  Dirichlet  distribution  is  a  multivariate  extension  of  the  Beta  distribution  for  a 
D-dimensional  vector,  x.  The  random  vector,  x,  is  denoted  as  Beta-distributed  by 

x~Dir(x|a).  (A. 28) 


A.  5.1  Parameters 

The  parameters  of  the  Dirichlet  distribution  are  the  elements  of  the  vector,  a,  such 
that 

0  <  ad  <  1,  d=  1,2,...,D,  (A. 29) 

D 

J2ad  =  i-  (A-3°) 

d=  1 


A. 5. 2  Probability  Density  Function 


The  density  function  for  the  Dirichlet  distribution  is 


p  (x|a)  = 


r  ( Ef.!  «d 


D 

n 


ntuM  i\Xd 


o>d- 1 


(A.31) 


A.  5. 3  Moments 

The  mean  and  variance  of  the  Dirichlet  distribution  are 

E  M  =  =¥—,  (A. 32) 

M,k=l 


Var  [xrf] 


(A. 33) 
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A. 5-4  Kullback-Leibler  Divergence 


The  Kullback-Leibler  Divergence  between  two  Dirichlet  distributions  is 

r  fX)d=i  aqd)  T  («,,,) 

KLD  [q  (x|a9)  1 1 p  (x|ap)]  =  log  - - y  +  ^  log  c 

T  (  Sd=l  )  d=l  \a<ld) 


D 

d=l 


A. 6  Gamma  Distribution 


D 


4  (aid)  “  ^  S 


a. 


<ik 


.k= 1 


(A. 34) 


The  Gamma  distribution  is  over  a  positive  random  variable,  x  >  0,  governed  by 
two  positive  parameters  to  ensure  proper  normalization.  The  random  variable,  x,  is 
denoted  as  Gamma-distributed  by 

x  ~  Gamma  (x\a,  b ) .  (A. 35) 

A.  6.1  Parameters 

The  parameters  of  the  Gamma  distribution  are  a  and  b ,  such  that 


a  >  0, 
b  >  0. 

A. 6. 2  Probability  Density  Function 

The  density  function  for  the  Gamma  distribution  is 

p(x\a,b)  =  — &ax0-1e-te, 

T(a) 

where  T  (•)  denotes  the  Gamma  function  given  by  (A. 24). 


(A. 36) 
(A. 37) 


(A. 38) 
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A.  6. 3  Moments 


The  mean  and  variance  of  the  Gamma  distribution  are: 


E[x]  =  - 


(A.39) 


Var  [x]  =  — 


(A. 40) 


A. 6. 4  Kullback-Leibler  Divergence 


The  Kullback-Leibler  Divergence  (KLD)  between  two  Gamma  distributions  is 
KILO  [q  (x\aq,bq)  \  \p  (x\ap,bp)]  =(aq  -  1  )ip(aq)  +  log  bq  -  aq  -  log  T(a,)  +  log  T(ap) 

Qnbr. 


-  ap  log  bp  -  (aP  -  1)  [i){aq)  -  log  bq)  + 


vqwp 

K’ 


(A.41) 


where  ^  (•)  is  the  digamma  function  defined  by 


(x)  —  —  log  T  (x) 
dx 


A. 7  Normal  (Gaussian)  Distribution 


(A. 42) 


The  Normal  distribution  of  the  continuous  variable,  x,  has  infinite  support  and  is 
governed  by  the  mean  and  variance  parameters.  The  random  variable,  x,  is  denoted 
as  Normally-distributed  by 

x~A/"(x|/i,  (t).  (A. 43) 

A.  7.1  Parameters 

The  parameters  of  the  Normal  distribution  are  /a  and  cr,  such  that 


—  OO  <  /d  <  oo, 

(A. 44) 

a  >  0. 

(A. 45) 
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A.  7.2  Probability  Density  Function 

The  density  function  for  the  Normal  distribution  is 


A.  7.3  Moments 


p(x\n,a) 


1  -(x-fj.)2 

P  Sct2 

\j2'n02 


E  [x\  —  fi 
Var  [x]  =  a2 


(A. 46) 


(A. 47) 
(A. 48) 


A.  7. 4  Kullback-Leibler  Divergence 


The  Kullback-Leibler  Divergence  (KLD)  between  two  Normal  distributions  is 


Tinner  /  I  \ 1 1  /  |  .  1,  ^  ~  2/MA  1 

KLD  [q(x\nq,aq)\\p(x\nP,ap)\  =-log  —  + 


oi 


2a2 

p 


(A. 49) 


A. 8  Multivariate  Normal  (Gaussian)  Distribution 


The  multivariate  extension  of  the  Normal  distribution  is  over  the  D-dimensional 
random  vector,  x,  whose  elements  Xj  G  (— oo,  oo)  for  d  —  1,2, ...,  D.  The  distribution 
is  governed  by  the  mean  vector  and  covariance  matrix.  The  random  vector,  x,  is 
denoted  as  Normally- distributed  by 


x  ~  J\f  (x|/x,  E) .  (A. 50) 

A.  8.1  Parameters 

The  parameters  of  the  Normal  distribution  are  /i  and  E,  such  that 


<  Hd  <  oo,  d  =  1.2....,  I). 

(A.51) 

E  is  positive-definite. 

(A.52) 
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A. 8. 2  Probability  Density  Function 

The  density  function  for  the  multivariate  Normal  distribution  is 


p  (x |/i,  S)  =  (2tt)-d/2  |S|-1/2e-5(x-#*)!rs  1(x-/*).  (A. 53) 


A.  8. 3  Moments 

The  mean  and  covariance  elements  are: 


E  [x]  =  n 

(A. 54) 

E  l-ldpk  T  P*dk 

(A. 55) 

Cov  [x]  =  £ 

(A. 56) 

A. 8. 4  Fullback- Leibler  Divergence 


The  Kullback-Leibler  Divergence  (KLD)  between  two  multivariate  Normal  distribu¬ 
tions  is 


KLD  [q(x.\fxq,  Hq)  |  |p(x|/x?,  Sp)]  =  -log  +  -Tr  [S/£g_ 


1  /  \  T  1  /  X  D 

9  K  _  P'p)  ^ P  K  —  Mp)  A' 


(A.57) 


A. 9  Wishart  Distribution 


The  Wishart  distribution  is  over  the  D  x  D  matrix  A,  and  is  the  conjugate  prior  for 
the  precision  (inverse  covariance)  matrix  of  a  multivariate  Normal  distribution.  The 
random  matrix,  A,  is  denoted  as  Wishart-distributed  by 

A  ~  W  (A|W,  v) .  (A. 58) 
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A.  9.1  Parameters 


The  parameters  of  the  Wishart  distribution  are  the  degrees  of  freedom, 
scale  matrix,  W,  which  must  satisfy  the  following: 

u>  D  -  1 

W  is  positive  definite. 

A. 9. 2  Probability  Density  Function 

The  density  function  for  the  Wishart  distribution  is 


p  (A|W,  v)  —  B  (W,  v)  |  A|  2  exp 


where 

B(W,v)  =  \W\~v/2 

A.  9. 3  Moments 
The  expected  values  of  A  and  log  |  A|  are 


D 

2^D/27rD(n>-i)/4  J-Jr 
d=  1 


v  +  1  —  d 


-1 


E  [A]  =  nW 

E[log|A|]  =  — -)  +  D  log  2  +  log  |W|, 

d=  1  ^  ' 

where  ip  (•)  is  the  digamma  function  defined  by  (A. 42). 


,  and  the 

(A. 59) 
(A. 60) 

(A. 61) 

(A. 62) 

(A. 63) 
(A. 64) 
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A. 9-4  Kullback-Leibler  Divergence 


The  Kullback-Leibler  Divergence  (KLD)  between  two  Wishart  distributions  is 
KLO  [g(A|  W9,  uq)  |  |p(A|  Wp,  vp)\ 


Vq~  D  -l 


up  —  D  —  1 


D 


,d=  1 


V a  ~  d  +  1 


+  Dlog2  +  log  |Wg 


D 


Y  tfj  ( Up  ~ d + 1 


Kd=  1 


+  D  log  2  +  log  |W„ 


VqD  +  ^Tr  (BpB,)  +  log 


2  2 

A.  10  Normal- Wishart  Distribution 


2^0/2  | 

Bp 

\-^YD{up/2) 

2^D/2|Bg| 

\-^2Td{uJ2) 

(A. 65) 


The  Normal- Wishart  distribution  is  a  joint  density  over  the  dxl  vector,  x,  and  the 
D  x  D  matrix,  A.  It  is  the  conjugate  prior  for  a  multivariate  Normal  distribution  with 
unknown  mean  and  precision  (inverse  covariance)  matrix.  The  random  variables, 
(x,  A)  are  denoted  as  Normal- Wishart  distributed  by 

(x,  A)  ~  J\f  (x| n,  rt_1A_1)  W  (A  |  W,  u) .  (A. 66) 

A.  10.1  Parameters 

The  parameters  of  the  Normal- Wishart  distribution  involve  many  of  the  same  param¬ 
eters  of  the  Normal  and  Wishart  distributions.  They  include  the  location  (mean), 
fi]  a  precision  scale,  u\  the  degrees  of  freedom,  u\  and  the  scale  matrix,  W,  which 
must  satisfy  the  following: 


—  00  <  Hd  <  00,  d  —  1,  2, ...,  D 

(A. 67) 

u  >  0 

(A. 68) 

u  >  D  -  1 

(A. 69) 

W  is  positive  definite. 

(A. 70) 
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A.  10.2  Probability  Density  Function 

The  density  function  for  the  Normal- Wishart  distributions  is  obtained  by  multipli¬ 
cation  of  the  Normal  and  Wishart  density  functions,  which  yields 
P(x,  W» 


=Af  (x|/x,  m_1  A-1)  W  (A|  W, 


--B(W,v)  (2vr)"D/2|nA|1/2|A|ii^expf  1 


T 

(  x  —  /i)  uA  (x  —  n) 


Tr  (W-1A) 


(A.71) 


where  B  (W,  u)  is  defined  in  (A. 62). 

A.  10.3  Moments 

The  expected  values  of  x  and  A  follow  the  Normal  and  Wishart  distributions: 

E  [x|n_1A_1]  =  /i  (A. 72) 

E  [A]  =  z/W  (A. 73) 

E  [log  | A|]  =  +  D log 2  +  log  |W||  (A. 74) 

d=  1  7  / 

where  (•)  is  the  digamma  function  defined  by  (A. 42). 

A.  10-4  Kullback-Leibler  Divergence 

The  Kullback-Leibler  Divergence  between  two  Normal- Wishart  distributions  is 

KLB  [q  (x,A\nq,uq,Wq,vq)  || p  (x,  A|/ip,  up,  Wp,  up)] 

D  +  log  Uq  _ -A  +  1  ^  _  Mp)  (A. 75) 

z  y  uq  Up  J  z 

+  KLB  [W  (A|Wq,  vq) \\W  (A| Wp,  up)\ , 
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where  KILO  [W  (A|W?,  uq)  ||W  (A|WP,  up)\  is  a  Kullback-Leibler  divergence  between 
two  Wishart  distributions. 
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Appendix  B 


Relevance  Vector  Machines 


The  relevance  vector  machine  (RVM),  originally  proposed  by  Tipping  [83],  was  used 
in  this  work  as  a  statistical  model  for  classification  and  regression.  In  this  appendix, 
the  variational  Bayesian  update  equations  for  a  single  RVM  regressor /classifier  [84] 
and  a  mixture  of  RVM  classifiers  are  presented. 

The  RVM  is  a  sparseness-promoting  technique  for  Bayesian  inference  of  regression 
and  classification  models.  Like  support  vector  machines  (SVMs)  [92],  RVMs  seek  a 
sparse  weighting  of  kernel-transformed  features.  While  the  SVM  accomplishes  this 
by  maximizing  the  margin  between  classes,  the  RVM  utilizes  sparseness-promoting 
priors.  The  overall  effect  is  a  model  that  does  not  require  tuning  (due  to  the  use  of 
noninformative  priors)  and  has  different  sparseness  properties. 
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B.l  RVM  Regression 

B.  1.1  Generative  Model  and  Variable  Definitions 


Un 

(tn  |w,xn) 


wT(j)  (xn)T 

(B.l) 

■A/'(tn|yn,r_1) 

(B.2) 

n  —  1, 2, N  is  observation  index 

d  —  1,  2, D  is  dimension  index 

x„  is  feature  vector  of  observation  n 

0(xn)  is  a  D-dimensional  kernel  transformation  of  xn 

w  is  a  D-dimensional  weight  vector 

yn  is  the  model  output  for  observation  n 

tn  is  the  target  value  for  observation  n 

t  is  the  precision  of  t 

B.l.  2  Priors 


w  ~  A/"  (w  0,  A-1)  ,  where  A  =  diag(a) 

(B.3) 

ad  ~  Gamma  (a^  flo,  bo) 

(B.4) 

t  ~  Gamma  (r  Co,  do) 

(B.5) 

B.l. 3  Variational  Posterior  on  w 

It  was  derived  in  Section  that  the  NFE  is  maximized  by  a  variational  posterior, 
q( w),  that  is  proportional  to  the  variational  expectation  of  the  true  log-posterior 
with  respect  to  all  other  model  parameters: 

q  (w)  oc  (logp  (w|— ))  (B.6) 
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The  true  log-posterior  may  be  calculated  from  Bayes’  theorem: 


logp  (w|— )  =  log p  (t |X,  w)  +  logp  (w)  —  K,  (B.7) 

where  K  denotes  a  normalizing  constant.  The  variational  posterior  can  be  calculated 
by  solving  the  true  log-posterior  as  a  function  of  w,  then  taking  the  variational 

expectation  (•): 
log  p  (w|— ) 


+  w7  Aw 


+  ^w7  Aw  -  K 


K 


N 


N 


t2n  -  2w7  r  ^  fn0(xn)  +  wT0  (xn)T  w  +  w7  Aw  -  K 


n=  1 


N 


-2w  tt  E  tn(p  (xn)  +  wJ 


n=  1 


n= 1 


N 


A  +  r  E  0  (X„)  0  (Xn 


n—  1 


W 


—  K 


Completing  the  square  reveals  that  w  is  Gaussian: 

logp(w|— )  =  logA/”  (w|m,  E) 

where 

N 

rE  tr 


E  = 


m  =  Tiu  y  in(f)  (xn, 

n= 1 

N 

A  +  r  ^  0  (xn)  0  (x„ 

n=l 


-1 


(B.8) 


(B-9) 


(B.10) 


(B.  11) 


Useful  moments  in  VB  updates  for  other  model  parameters: 

(w)  =  m 

(wwT)  =  mm7  +  E 
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(B.12) 

(B.13) 


B.1.4  Variational  Posterior  on  a 


It  was  derived  in  Section  that  the  NFE  is  maximized  by  a  variational  posterior, 
q(ct),  that  is  proportional  to  the  variational  expectation  of  the  true  log-posterior 
with  respect  to  all  other  model  parameters: 

q(a)  oc  (logp(a:|—))  (B.14) 

The  true  posterior  may  be  calculated  from  Bayes’  theorem: 

p(o:||— )  oc  p  (w|a)  p  (ck)  (B.15) 


The  variational  posterior  can  be  calculated  by  solving  the  true  posterior  as  a  function 
of  a,  then  taking  the  variational  expectation  (•): 


p  I 


«|-)  oc  exp  1  exp  (- bad ) 


d=  1 
D 


oc  n  ad  exP  ad°  1  exP  (~&0 Old) 


d=  1 
D 


(B.16) 


a0+§  — 1 
oc  |  |  ad  < 

d=  1 


n 


f 

1  9" 

1  —  Old 

ho  +  2  wd 

Therefore,  the  a’s  are  Gamma  distributed: 

D 

p{cx |-)  =  Gamma  {ad\ad,  bd) ,  (B.17) 

d=  1 


where 


ad  —  a  0+2 


bd  =  b0  +  -wd. 


Useful  moments  in  VB  updates  for  other  model  parameters: 


/  \ 

\ad)  =  T- 
Od 


(B.18) 

(B.19) 


(B.20) 
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B.1.5  Variational  Posterior  on  r 


It  was  derived  in  Section  that  the  NFE  is  maximized  by  a  variational  posterior, 
q(r),  that  is  proportional  to  the  variational  expectation  of  the  true  log-posterior 
with  respect  to  all  other  model  parameters: 


q(r)  oc  (logp(r|— )) 


(B.21) 


The  true  posterior  may  be  calculated  from  Bayes’  theorem: 


p(t H  oc  P  (t|X,  t)p(t) 


(B.22) 


The  variational  posterior  can  be  calculated  by  solving  the  true  posterior  as  a  function 
of  r,  then  taking  the  variational  expectation  (•): 


N 


p(t\-)  oc  JJ 


n=l 


PofW  T(t"_9"r 


rCo“1exp(-d0r) 


N_ 

(xr  2  exp 


rJ2n=  1  {tn-Vnf 


t Co"1  exp  (—d0r) 


ocexp  — r 


Zn=l  (*n  -  Wr0  (Xn))' 


Tf +C0-1 


oc  exp 


^  N  N  1  N 

~T  I  d°  +  2  S  “  wi  tn 0  M  +  2  S  ^  (X™)T  wwT(^  (x») 

n=l  n=l  n—  1 


A 

T  2 


(B.23) 


Therefore,  r  is  Gamma-distributed: 


p(r |— )  =  Gamma  (r|c,  d) 


(B.24) 


where 


N 

C=C»  +  V 

1  V  AT  AT 

d  =  do  +  -  ^  t2n  -  Wr '^2  tn(P  (x„)  +  2  ^  (X«)?  WwiV>  (xr; 

n=  1  n=l  n=l 


(B.25) 

(B.26) 


+CQ- 1 


223 


Useful  moments  in  VB  updates  for  other  model  parameters: 

M  =  \  (B.27) 

(logr)  =  -0(c)  —  ip(d),  where  ip((f>)  =  -f-  log  T(0)  (B.28) 

acp 

B.1.6  Negative  Free  Energy 

The  negative  free  energy  (NFE)  serves  as  the  variational  lower  bound  to  the  true  log- 
evidence.  Therefore,  in  it  serves  as  an  optimization  criterion  for  variational  learning. 
The  NFE  can  be  expressed  as  the  difference  between  the  expected  log-likelihood  and 
the  Kullback-Leibler  divergence  (KLD)  between  the  variational  posteriors  and  the 
priors: 

7  =(logp(t|w,  X))  -  KLD  [q  (w)  q  (A)  q  (r)  1 1 p  (w|  A)  p  (A)  p  (r)] 

=  (logp(t|w,  X))  —  KLD  [q  (w)  || p  (w|A)] 

D 

-  Y  KLD  [q  («d)  I  \p  («d)]  -  KLD  [q  (r)  I  |p  (t)] 

d=l 

=  -  y  log  2VT  +  ^  (log  T)  -  Y  [tn  ~  (w T)(j)  (xn)]  2 

”  n=  1 

D 

-  KLD  [q  (w)  |  |p  (w|  A)]  -  Y  KLD  [?  ( ad )  1 1 P  (ad)]  “  KLD  [q  (r)  |  |p  (r)] 

d=l 

(B.29) 

where  KLD  [q  (w)  ||p  (w|A)]  is  a  KLD  between  two  Gaussian  distributions, 

KLD  [q  (ad)  \  \ p  (a^)]  is  a  KLD  between  two  Gamma  distributions,  and  KLD  [q  (r)  |  |p  (r)] 
is  also  a  KLD  between  two  Gamma  distributions. 
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B.2 

B.2.1 

RVM  Classification 

Generative  Model  and  Variable  Definitions 

Un 

=  w r(f)  (x„,)T 

(B.30) 

v(yn) 

1 

1  +  e~yn 

(B.31) 

(tn  |W,X„) 

~  o{yn)tn  [i  ^(2/n)]1-tn 

(B.32) 

n  —  1, 2, N  is  observation  index 

d  =  1,  2, D  is  dimension  index 

xn  is  feature  vector  of  observation  n 

4>{x.n)  is  a  D-dimensional  kernel  transformation  of  xn 

o  (•)  is  the  logistic  sigmoid  function 

w  is  a  D-dimensional  weight  vector 

yn  is  the  model  output  for  observation  n 

tn  is  the  binary  label  for  observation  n 

t  is  the  precision  of  t 

B.2.2  Priors 

w  ~  J\f  (w|0.  A-1)  ,  where  A  =  diag  («)  (B.33) 

ad  ~  Gamma  (a<i|ao,  b0) ,  typically  ao  =  bo  —  10-6  (B.34) 


B.2.3  Approximate  Likelihood 

Because  the  binomial  distribution  on  t  does  not  offer  conjugate  updating  for  our 
choice  of  the  prior  on  w,  we  impose  a  lower-bound  approximation  to  the  likelihood, 
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p(tn |w,  xn),  that  was  proved  by  Jakkola  and  Jordan  [97]: 

P(tn |w,  xn)  >  a  (fn)  exp  2  —  -  A  (fn)  (7*  -  (B.35) 

where  is  a  variational  parameter  and 

7 n  =  (2 !t„  -  1)  yn  (B.36) 

A  ^  =  4^  tanh  (y )  (B-37) 

Therefore,  the  log-likelihood  will  be  approximated  as 

l0gp(t„|w,  Xn)  >  log  cr  (fn)  +  ^  (7 n  -  fn)  ~  A  (f„)  (bn  “  £n)  (B-38) 

5.2.^  Variational  Posterior  on  w 


It  was  derived  in  Section  that  the  NFE  is  maximized  by  a  variational  posterior, 
q( w),  that  is  proportional  to  the  variational  expectation  of  the  true  log-posterior 
with  respect  to  all  other  model  parameters: 

q  (w)  oc  (logp  (w|— ))  (B.39) 

The  true  log-posterior  may  be  calculated  from  Bayes’  theorem: 

logp  (w| — )  =  logp  (t|X,  w)  +  logp  (w)  —  K,  (B.40) 

where  K  denotes  a  normalizing  constant.  The  variational  posterior  can  be  calculated 
by  solving  the  true  log-posterior  as  a  function  of  w,  then  taking  the  variational 
expectation  (•): 
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log  p  (w|— ) 


N  r 


Y  +  X  {in  ~&)  ~  A(0  f7n  -fn 


n=  1  L 

1 

~  2 
1 

~  2 
1 


-  -  +  log-jAf  +  w7’ Aw]  -  K 


N 


WJ  Aw  +  Y  (2A  (f„)  7n  -  7n) 


n=l 

N 


—  K 


n=  1 
N 


w1  Aw  +  V  f  2A  (fn)  w7'0  (xn)  0  (xn)rw  -  (2tn  -  1)  wT0  (xn) 


K 


N 


-2wr  (  -  Y  (2i*»  “  !)  0  (x«)  )  +  wT  (  A  +  2  V  A  (£n)  0  (x„)  0  (x„)TJ  w 

(B-41) 


n=  1 


n=l 


Completing  the  square  reveals  that  w  is  Gaussian: 


K 


logp(w|— )  =  logA/”(w|m,  £) 


(B.42) 


where 


N 


m  =  -£  V  (2 tn  -  1)  0  (xr; 


vn=  1 


N 


-1 


S=  A  +  2£  A  (fn)  0  (Xn)  0  (Xn) 


n=  1 


Useful  moments  in  VB  updates  for  other  model  parameters: 

(w)  =  III 

(ww7  )  =  mm7  +  E 


(B.43) 


(B.44) 


(B.45) 

(B.46) 


B.2.5  Updating  £ 

Since  the  variational  parameter  £  is  assumed  known  (i.e.  no  prior  or  posterior  den¬ 
sity),  the  updates  cannot  be  found  starting  with  Bayes’  theorem.  Update  equations 
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can  still  be  derived  by  directly  optimizing  the  NFE: 


T  =(logp(t|X,  w))  -  KLD  [q  (w)  1 1 p  (w)]  -  KLD  [q  (A)  1 1 p  (A)] 
=^(logp(t|X,w)) 


(B.47) 


Substituting  the  approximation  for  p(tn|w,  xn): 


07 


n= 1 


2  +  2£nA(£n) 


dMjr 

9^ 


(<7n) 


=E 

n= 1 


e-€n/2 

gCn/2  _J_  £n/2 


+ 


Ip«n/2  _  lp-W2 

2° _ 2_ _ 

g£n/2  _J_  g— ^n/2 


1  aA(£n) 

2  <9£n 


(<7n)  -  fn) 


=  E 

n= 1 


Ie«n/2  +  Ig-^n/2 

g?n/2  _|_  g— £n/2 


1  <9A(^n) 

2  C>£n 


(<7n>  -  fn) 


(B.48) 


\  ^  <9A(£n) 


((7n)  -  ® 


Because  the  derivative  of  A(£n)  is  purely  negative,  7  is  maximized  at 

in  =  (7 1)  =  <t>  (Xn)T  (WW T)(t)  (Xn)  (B.49) 

B.2.6  Variational  Posterior  on  a 

It  was  derived  in  Section  that  the  NFE  is  maximized  by  a  variational  posterior, 
q(ac),  that  is  proportional  to  the  variational  expectation  of  the  true  log-posterior 
with  respect  to  all  other  model  parameters: 


q(a)  oc  (log p(a|— )) 

The  true  posterior  may  be  calculated  from  Bayes’  theorem: 

p(a|— )  oc  p  (W|a)  p  (a) 


(B.50) 


(B.51) 
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The  variational  posterior  can  be  calculated  by  solving  the  true  posterior  as  a  function 
of  a,  then  taking  the  variational  expectation  (•): 


p  (a|  — )  oc 


n  exp  1  exp  (~b°a^ 


d=  1 
D 


exp  (  -FN1A  j  a“ o  1  exp  (-b0acl) 
d=  i  '  ' 


D 

n 


“o+§— 1 


oc  j_  |  exp  (  -arf 

<2=1 


&o  + 


Therefore,  the  a’s  are  Gamma  distributed: 

D 

9  (a)  =  II  Gamma  (ad\ad,  bd) , 

d=  1 


(B.52) 


(B.53) 


where 


1 

ad  —  a  0+2 

(B.54) 

'd  =  b0  +  ^ w2d , 

(B.55) 

Useful  moments  in  VB  updates  for  other  model  parameters: 


(«d)  =  (B.56) 

bd 

B.2.7  Negative  Free  Energy 

The  NFE  can  be  expressed  as  the  difference  between  the  expected  log-likelihood  and 
the  Kullback-Leibler  divergence  (KLD)  between  the  variational  posteriors  and  the 


229 


priors: 


T  =(logp(t|w,  X))  —  KLD  [q  (w)  q  (A)  || p  (w|A)p  (A)] 

D 

— (l°gp(t|w,  X))  -  KLD  [q  (w)  ||p(w|A)]  —  ^  KLD  [q  (ad)  \  \p(ad)] 

d=  1 

N  1 

=  +  2  -  fn)  -  A  (f„)  {{in?  ~  Q 

n=l  ” 

D 

-  KLD  [q  (w)  1 1 p  (w|  A)]  -  KLD  [q  (ad)  \  \p  (ad)] 

d=  1 
N 

=  1oS  (fn)  +  2  (2t«  -  !)  0  (X«)T  (W)  -  fn  "A  (fn)  (xn)T  (ww  T)0  (xn)  - 

72=1  ~ 

D 

-  KLD  [g  (w)  |  |p  (w|  A)]  -  ^  KLD  [g  (ad)  |  |p  (ad)] 

d=l 

(B-57) 

where  KLD  [g  (w)  |  |p  (w|  A)]  is  a  KLD  between  two  Gaussian  distributions,  and 
KLD  [g  (af)  1 1 p  (ag*)]  is  a  KLD  between  two  Gamma  distributions. 

B.3  Mixture  of  RVM  Classifiers 

B.3.1  Generative  Model  and  Variable  Definitions 


Vnm  =  W^(/)  (x/ 

(B.58) 

a  {Unm)  —  1  .  _y 

g  ynm 

(B.59) 

{tn  W,  xn,  c)  ~  (cr(ynm)tn  [1  -  o(ynm)]l~tn )  cnm 

(B.60) 

n  —  1, 2, N  is  observation  index 
d  =  1,  2, D  is  dimension  index 
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m  =  1,  2, M  is  mixture  component  index 

xn  is  feature  vector  of  observation  n 

0(xn)  is  a  D-dimensional  kernel  transformation  of  xn 

a  (•)  is  the  logistic  sigmoid  function 

w.m  is  a  D-dimensional  weight  vector 

ym  is  the  model  m  output  for  observation  n 

tn  is  the  binary  label  for  observation  n 

cn  =  {cnm}  is  a  latent  variable  governing  mixture  component  selection 
r  is  the  precision  of  t 

B.3.2  Priors 

W m  =  A/”(wm|0.  A^1)  ,  where  Am  =  diag(am)  (B.61) 

oimd,  ~  Gamma  (amci|ao,  &o) ,  typically  ao  =  bo  =  10-6  (B.62) 


B.3.3  Variational  Posterior  on  w 

It  was  derived  in  Section  that  the  NFE  is  maximized  by  a  variational  posterior, 
q( W),  that  is  proportional  to  the  variational  expectation  of  the  true  log-posterior 
with  respect  to  all  other  model  parameters: 

q(W)  oc  (logp(W|— ))  (B.63) 

The  true  log-posterior  may  be  calculated  from  Bayes’  theorem: 

logp  (W|— )  =  logp  (t|W,  X,  — )  +  logp  (W)  —  K,  (B.64) 

where  K  denotes  a  normalizing  constant.  The  variational  posterior  can  be  calculated 
by  solving  the  true  log-posterior  as  a  function  of  W,  then  taking  the  variational 
expectation  (•): 
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logp(W|-) 


n=  1  m= 1 


m= 1 


M 


-  E 


m=l 

M 


N 


w; 


i-A-mWm  “I-  ^  ^  Cnm  (2A  (£nm)  7tim  Tram) 


-  E 


ra=l 

M 


n=l 

N 


-  K 


n= 1 


w^Amwm  +  ^  cnm  ( 2A  (U)  w^0  (xn)  0  (xn)Twm  -  (2tn  -  1)  w ^0  (x„) 


-  E 


m=l 


AT 


_2W™  o  Cnm  (2t™  “  X)  ^  (X» 


n=l 


iV 


W, 


Am  -|-  2  ^  )  Cnm A  (£nm)  0  (Xn)  0  (^n 


W, 


n=l 


Jl 


Completing  the  square  reveals  that  W  is  Gaussian: 

M 

logp(W|— )  =  ^  logA/'(wm|rnm,  Sm) 

m=l 


where 


m, 


=  -Sr 
2  r 


AT 


^  ^  Cnm  (2tn  1)0  (Xn) 


vn=l 


V  — 


Ar 


N 

E 

n=l 


Cnm^  ((nm  )  0  (Xji)  0  (x 


Useful  moments  in  VB  updates  for  other  model  parameters: 


(wm)  =  mm 


(B.65) 


(B.66) 


(B.67) 


(B.68) 


(B.69) 


—  K 
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(wmw^)  =  m,mmjn  +  Sm 


(B.70) 


5.3.^  Updating  £ 

Since  the  variational  parameter  £  is  assumed  known  (i.e.  no  prior  or  posterior  den¬ 
sity),  the  updates  are  found  by  directly  optimizing  the  negative  free  energy: 

T  =<logp(t|X,  W)>  -  KLD  [9  (W)  I  |p  (W)]  -  KLD  \q  (A)  ||p  (A)] 
^-=^<logp(t|X,W)) 

<9£  (B.71) 

f)T  AA  f) 

07  =  ^^-^(1°gP(tn|xn,wm)) 

?  n= 1  m=l  ?nm 


Substituting  the  approximation  for  p(tn|W,x, 

AT  M 


87 


EE' 

72=1  ra=l 

AT  M 

EE' 

72=1  772=  1 

TV  M 

EE 

72=1  772  =1 

TV  M 

-EE' 

72—1  772=1 


1  +  e^n 


i  +  2f„mA«nm)  -  ((tL)  _ 

^  ^'S72772 


p—£nm/ 2  lpCnm/2  —  Ip  £nm/ 2  1  f)\(£  \ 

e  ,  2_ _ 2_ _ ^  _  U/X\<,nm)  /  /  9  \  _  a2  A 

'  f-/o.  &  /n  rx  CA  >.  y  \  lnml  S ^72772 / 


Cr, 


-|—  g  £nra/  2  g(nm  /2  +  g_(nm/2 

igCnm/2  +  ie-^/2 


g^nm/ 2  _J_  g  ^nm/2 

<9A(^ 


2  <9£„ 

((7nm)  -  i) 


<9^ 


(V  )-£2  ) 

\  \  /  72772/  S  72772  / 


(B.72) 


Because  the  derivative  of  A(£n)  is  purely  negative,  J7  is  maximized  at 

£nm  =  <7nm>  =  <t>  (X™)T  (WmWD^  (xn)  (B.73) 

5.3.5  Variational  Posterior  on  a 

It  was  derived  in  Section  that  the  NFE  is  maximized  by  a  variational  posterior, 
q(ot),  that  is  proportional  to  the  variational  expectation  of  the  true  log-posterior 
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with  respect  to  all  other  model  parameters: 


?(«)  °c  (logp(a|-)) 


(B.74) 


The  true  posterior  may  be  calculated  from  Bayes’  theorem: 


p(a|— )  ocp(W|a)p(a) 


(B.75) 


The  variational  posterior  can  be  calculated  by  solving  the  true  posterior  as  a  function 
of  a,  then  taking  the  variational  expectation  (•): 


M  D  /  2  \  / 

P(a\~)  °C  n  II eXP  (  -  a'Ud!?Umd  )  eXP 


m=  1  d=  1 


m= 1  d=l 


n  n  a-ex P  (  -amd™md  )  a2d  1  exp  (-b0amd) 


(B.76) 


n  n  1  exp  ( 6° + 


Therefore,  the  cc’s  are  Gamma  distributed: 


where 


p{at |-)  =  Gamma  (amd\amd,  br 


m= 1  d=l 


(B.77) 


dmd  —  CL o  T  g 


(B.78) 


bmd  —  bo  +  2  Wmdi 


(B.79) 


Useful  moments  in  VB  updates  for  other  model  parameters: 


/  \ 
f  (%md)  ~i 

Omd 


(B.80) 


(Am)  =  diag(am) 


(B.81) 
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B.3.6  Treatment  of  c 


The  latent  variable  c  defines  which  component  of  the  RVM  mixture  is  used.  For 
a  fully-conjugate  model,  cn  ~  Multinomial(pn)  and  pn  ~  Dir(Ao).  However,  this 
approach  is  not  used  in  this  work  and  therefore  not  discussed  here. 

In  this  work,  the  mixture  of  RVMs  is  used  in  a  context-dependent  learning  frame¬ 
work.  For  cases  in  which  supervised  context  modeling  is  used,  the  values  cnm  are 
determined  by  the  known  context  labels.  Therefore,  cnm  =  1  for  observations  col¬ 
lected  in  the  mth  labeled  context.  If  unsupervised  context  modeling  is  used,  c  can 
be  treated  multinomial  distributed  and  its  density  is  determined  a  posteriori  from 
context  identification.  Therefore  ( cnm )  =  p(cnm|x„  ')  regardless  of  the  context  model 
used  to  obtain  these  posterior  probabilities.  See  Chapter  4  for  more  information.  Dis¬ 
criminative  context-dependent  learning  is  a  unique  case  that  is  described  in  Chapter  5 
and  the  VB  derivation  can  be  found  in  Appendix  E. 

B.3.7  Negative  Free  Energy 

The  NFE  can  be  expressed  as  the  difference  between  the  expected  log-likelihood  and 
the  Kullback-Leibler  divergence  (KLD)  between  the  variational  posteriors  and  the 
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priors: 


T  =  (logp(t|w,  X))  —  KLD  [q  (W)  q  (A)  |  |p  (W|  A)  p  (A)] 

M  M  D 


-  (log  p(t  |"W,  X))  ^  ]  1K1LD  [g  (wm)  1 1 p  (wm  |  Am)  i-EE  KLD  [q  (amd)  \  \p  (, amd )] 


m=  1 


m= 1  d=l 


IV  M 


n= 1  m=l 
M 


ED  Cnm  )  l°Scr('Cn»n)  +  „  ((7 n)  £nm)  ^  ((nm)  ((7 nm)  £nm) 


M  D 


^  KLD  [g  (wm)  |  |p  (wm|  Am)]  -  ^  KLD  [g  (amd)  |  |p  (amd)] 


m=l 

N 


m=  1  d=l 


y!(Cnm) 


71=1 


1  r  T 

log  a  (£„)  +  -  |^(2tn  -1)0  (xn)  (wm)  -  0n 


-  A  (f„m)  (  <t>  (x„)T  (wmW^)0  (xn)  -  ^ 


—  KLD  [g  (w)  ||p  (w|A)] 


D 


y^  KLD  [gM||pM] 


d=l 


(B.82) 


where  KLD  [g  (w)  ||p  (w|A)]  is  a  KLD  between  two  Gaussian  distributions,  and 
KLD  [g  (ad)  \  \ p  (ad)]  is  a  KLD  between  two  Gamma  distributions. 
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Appendix  C 


Dirichlet  Process  Gaussian  Mixture  Models 


One  of  the  generative  context  models  presented  in  Chapter  4  is  the  Dirichlet  process 
Gaussian  mixture  model  (DPGMM).  The  DPGMM  can  be  useful  when  performing 
unsupervised  clustering  in  scenarios  where  the  number  of  clusters  is  uncertain.  This 
appendix  presents  the  DPGMM  of  Blei  and  Jordan  [67],  as  well  as  derivations  for  all 
variational  Bayesian  (VB)  update  equations  and  the  negative  free  energy  (NFE). 

C.l  Generative  Model  and  Variable  Definitions 

(x|cnm  =  1)  ~  No  (x|p,m,  A”1)  (C.l) 

x  is  D  x  1  feature  vector 

cn  is  M  x  1  binary-coded  latent  variable 

n  —  1,  2, ...,  N  is  data  index 

m  —  1,  2, ...,  T  is  mixture  component  index  (T  is  arbitrarily  large) 
d  =  1,2 , D  is  dimension  index 
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C.2  Priors 


(/^mi  ^m)  ~  Nd  (MmlPcb  u0  Am  )  (Am|Bo,  r'o) 

Cyi  rSJ  Multinomial  (77) 

vrm  =  nmJJ(l-^) 

l<m 

Ujjl  rs-/  Beta  (1,  a) 
a  ~  Gamma  (ri0,  T20) 


(C.2) 

(P-3) 

(P-4) 

(C-5) 

(C-6) 


C.3  Model  likelihood 

The  joint  likelihood  of  data  given  all  model  parameters  is  given  by 

N  T 

?(x  1-)  n  n v"  (c.r) 

n= 1  m—  1 

^  N  T 

log  p  (X|  — )  =  ~2  S  Cnm  ^  l0g27r  +  l0g  +  (X  “  Mm)T  Arn  (X  -  Mm) 

n=  1  m=l 

(P.8) 

C.4  Variational  Posterior  on  n  and  A 

It  was  derived  in  Section  E.10  that  the  NFE  is  maximized  by  a  variational  posterior, 
q(H,  A),  that  is  proportional  to  the  variational  expectation  of  the  true  log-posterior 
with  respect  to  all  other  model  parameters: 

q  (/1,  A)  oc  (logp  (fi,  A|— ))  (C-9) 

The  true  log-posterior  may  be  calculated  from  Bayes’  theorem: 

l°gP  (Mi  A|— )  =  logp  (X|/x,  A,  — )  +  logp  (^1  A)  +  logp  (A)  —  K,  (C.10) 
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where  K  denotes  a  normalizing  constant.  The  variational  posterior  can  be  calculated 
by  solving  the  true  log-posterior  as  a  function  of  p  and  A,  then  taking  the  variational 
expectation  (•): 


logpO,  A|-)  = 

^  N  T 


2  XI  Cnm  +  log  |  Aj  |  +  (x  -  pm)T  Am  (x  -  p 

n= 1  ra=l 
1  T 

2  -  j2iogAF  +  log  |  A”1 1  +  (/xm  -  p0)T  u0 Am  (/xm  -  p0) 

L 

z/Q  -  D  -  1 


E 

m=l 


-Tr  (B^Aj 


K 


N  T 

^  ^  ^  ^  ^nm  2^ZmA.mX  ~~b  X  A.mxj 

n=l  m= 1 
T 

[Mm«oAm/xm  -  2pJnu0Amp0  +  p0Tu0Amp0\ 

m= 1 
T 

E 


m=l 

T 


V 


Tr  (B0  1Am)  -  i/0  +  cnm  -  D  -  1  log  |Ar 


n=l 


K 


-  E 


m=l 


2/xmA 

m  fwoPo  +  E  OnmXji  l  +  Mm  fe  Cnm  +  A  m  Mm 


AT 


AT 


V 


T  pm  J  ^  ^  Cnm  T  Uq  A m  Pm  Pm  I  ^  ^  Cnm  T  Uq  A mPm  T  ^  ^  Cnm,X  AmX 


v.  n=l 


v  n=  1 


n=l 


AT 


+  p0Tu0Amp0  +  Tr  (B0  1  Am)  —  z/q  +  E  Cnm  D  —  1  log  |  A, 


n= 1 


-A 


(C.ll) 
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Using  the  identity  a2  Ba  =  Tr  ( aa7  B)  yields: 


log  p  (n,  A) 


m= 1 


(Pr 


Pin)  ^mAm  (pt.r 


Tr  (umSmAm)  -|-  Tr  (CmAm) 


+  Tr  (m0S0A0)  +  Tr  (B0  1  Am) 


N 


Vo  +  Cnm  -  D  -  1  lQg  I- A-r 


n=l 


—  K 
(C.12) 


Where 


TV 


Mq  +  1 


n=l 


MoMo  T  X^n=l  CnmX- 


Pm 


Ur 


n  _  J 

PmPm 


TV 


=  >  CnmXX 
n=l 


cm  -  22 

So  —  PoPo 


(C.13) 

(C.14) 

(C.15) 

(C.16) 

(C.17) 
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Consolidating  the  last  four  terms  using  the  identity  Tr  (A)+Tr  (B)  =  Tr  (AB)  yields: 

log  V  (Ah  A|— ) 


r)  ^  ^  {Pm  Pm)  ^m.Am  (pm  Pm) 


Tr  (CmAm  ^mSmAm  T  UqSqAq  T  Bq  Am)  (  T  ^  ^  Cnm  D  1 


log  |Am|  —  K 


1  v — 21 

—  ^  4  {Pm  —  Pm)  Um  Am  {Pm  Pm) 


+  Tr  [(Cm  —  um Sm  +  UqSq  +  B0  Am] 

-  (u0  +  ^2cnm  -  D  -  log  I Am\  -  K 


(C.18) 


Consolidating  terms  reveals  that  p,  A  are  Normal-Wishart: 


l°gP  (M>  A|  — )  =  ^  loS  [N  {Pm\Pm,UmAm)  W  (A|  (C.19) 


where  pm  and  um  are  defined  above,  and 


]ym  —  U)  +  c„ 


(C.20) 


B  m  (Cm  MmSm  +  "WoSq  +  Bq  ) 


(C.21) 


Useful  moments  in  VB  updates  for  other  model  parameters: 


( Pm )  Pn 


(C.22) 
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(C.23) 


(M mMm )  =  T  Um  (C.24) 

(log  |Am|)  =  E^  (— — 2  +  +  -Dlog2  +  log  |Bm|  (C.25) 

d=  1  '  ' 

((xn  Mm)  (xn  —  Mm))  =  (X«  Mm)  (xn  —  Mm)  “I  (C.26) 


C.5  Variational  Posterior  on  v 


It  was  derived  in  Section  that  the  NFE  is  maximized  by  a  variational  posterior, 
g(v),  that  is  proportional  to  the  variational  expectation  of  the  true  log-posterior 
with  respect  to  all  other  model  parameters: 

q  (v)  oc  (logp(vmj-))  (C.27) 

The  true  log-posterior  may  be  calculated  from  Bayes’  theorem: 

log p(ym |  )  =  logp(C|um)  +  logp(vm)  -  K,  (C.28) 


where  K  denotes  a  normalizing  constant.  The  variational  posterior  can  be  calculated 
by  solving  the  true  log-posterior  as  a  function  of  v,  then  taking  the  variational 
expectation  (•): 


logp(uml-) 


N  N 

E^log^Vn  +  EE  zni  log(l  -  vm)  +  (a  -  1)  log(l  -  vm)  -  K 

n= 1  n= 1  Z>m 


N 

E Cnm  log  vm  + 

n=l 


N 

«+E£  Cnl  1 

n= 1  l>m 


log(l  Vm)  -  K 


(C.29) 


Therefore, 


p(vm  |-)  =  Beta(7mi,7m2) 


(C.30) 
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where 


N 

7 ml  =  1  "h  ^  )  Cnm  (C.31) 

n= 1 
N 

7m2  —  Q.  +  Cni  (C.32) 

n= 1  i>m 

Useful  moments  in  VB  updates  for  other  model  parameters: 

(hi  I'm)  =  V’(Tml)  -  ^(7ml  +  7m2)  (C.33) 

(ln(l  -  vm))  =  ^(7m2)  -  ^(7rni  +  7m2)  (C.34) 

C.6  Variational  Posterior  on  a 

It  was  derived  in  Section  that  the  NFE  is  maximized  by  a  variational  posterior, 
q(a),  that  is  proportional  to  the  variational  expectation  of  the  true  log-posterior 
with  respect  to  all  other  model  parameters: 

q  (a)  oc  (logp(o;|— ))  (C.35) 

The  true  log-posterior  may  be  calculated  from  Bayes’  theorem: 

logp(a|— )  =  logp(v|a)  +  \ogp(a)  —  K,  (C.36) 

where  K  denotes  a  normalizing  constant.  The  variational  posterior  can  be  calculated 
by  solving  the  true  log-posterior  as  a  function  of  a,  then  taking  the  variational 
expectation  (•): 

T— 1 

logp(a|-)  =y^(a:  -  X)  log(l  -  vm)  -  K 

m= 1 

T  log  — — ^exp(-r2oa  +  riologa-logTVrThJ+7i(>ieg^o) 
a 

T- 1 

-720  +  l0g^  -  Vm) 

m=  1 
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r=l 


Vm)  +  log  a(rio  +  T  -  1)  -  K 

(C.37) 


Therefore, 


p(a  |— )  =  Gamma(ri,T2) 

Ti  =  t10  +  T  -1 

T- 1 

T2  =  T20  ~  Y  l0g^  ~  U"») 

ra=l 

Useful  moments  in  VB  updates  for  other  model  parameters: 

/  \  Tl 

(«)  =  — 

T2 

C.7  Variational  Posterior  on  C 


(C.38) 

(C.39) 

(C.40) 


(P-41) 


It  was  derived  in  Section  that  the  NFE  is  maximized  by  a  variational  posterior, 
q( C),  that  is  proportional  to  the  variational  expectation  of  the  true  log-posterior 
with  respect  to  all  other  model  parameters: 

g(C)oc(logp(C|— )>  (C.42) 

The  true  log-posterior  may  be  calculated  from  Bayes’  theorem: 

log p  (C|— )  oc  logp  (X|C,  -)  +  logp  (C)  -  K,  (C.43) 


where  K  denotes  a  normalizing  constant.  The  posterior  will  also  be  multinomial 
with  parameters  (responsibilities)  <f>: 
log  p(cnm  =  1|-) 


-^io^27T  -  ^  log  |Am]  |  -  ^  (xn  -  nm)T  Am  (xn  -  nm)  +  log  vm  +  Y  log  (X  _  u) 

l<m 


=^log|v. 


(xn  -  Am  (xn  -  fim)  +  logum  +  Y  ^  (1  -  u»)  _  K 


l<m 


(C.44) 


—  K 
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Therefore, 


p(cn  |  — )  oc  Mul t inornial (4>n)  (C.45) 

Useful  moments  in  VB  updates  for  other  model  parameters: 

(c-nm)  =  (finm  (C.46) 

C.8  Negative  Free  Energy 

The  NFE  can  be  expressed  as  the  difference  between  the  expected  log-likelihood  and 
the  Kullback-Leibler  divergence  (KLD)  between  the  variational  posteriors  and  the 
priors: 

T  =(logp(X|/j.,  A,  C))  -  KLD  A,  C,  v,  a)\\p(fi,  A,  C,  v,  a)} 

N 

— (l°gp(X|/x,  A,  C))  -  J^KLD  [g(cn)||p(c„|v)] 

n=  1 
T 

~~  'y  ^  KLD  [?(/xm| Am)^(Am) |  |p(/xm| Am)p(Am)] 

m=  1 
T 

-  22  KLD  [q(vm)\\p(vTn)]  ~  KLD  [q(a)\\p(a)] 

m=  1 

^  N  T 

=  -  2  ^2^2(Cnm)  D  log  2tt  -  (log  |Am|)  +  ((x  -  Hm)T  Am  (x  -  nm)) 

n=  1  m=l 

N  T 

-  ^2  KLD  [g(cn)||p(c„|v)]  -  22  KLD  [q(p,m\Am)q(Am)\\p(p,m\Am)p(Am)} 

n= 1  m=  1 

T 

-  22  KLD  [q(vm)\\p(vm)]  -  KLD  [g(a)||p(a)] , 

m=  1 

(C.47) 

where  KLD  [g(cn)||p(cn|v)]  is  a  KLD  between  two  multinomial  distributions, 

KLD  [q(fj,m\Am)q(Am)\\p(fj,m\Am)p(Arn)]  is  a  KLD  between  two  Normal- Wishart 
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distributions,  KLD  [q{vm)\\p(vm)]  is  a  KLD  between  two  Beta  distributions,  and 
KILO  [g(a)||j9(o;)]  is  a  KLD  between  two  Gamma  distributions. 
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Appendix  D 

Dirichlet  Process  Mixture  of  Factor  Analyzers 


The  Dirichlet  process  mixture  of  factor  analyzers  (DPMFA)  model  is  used  in  this 
work  for  generative  context  learning.  Like  the  Dirichlet  process  Gaussian  mixture 
model  (DPGMM),  it  is  an  unsupervised  clustering  technique  that  facilitates  learning 
the  number  of  clusters.  Additionally,  the  use  of  the  factor  analysis  model  allows  for  a 
local  latent,  lower-dimensional  structure  to  be  learned  for  each  mixture  component. 
This  is  accomplished  by  selecting  features  from  a  shared  loading  matrix  that  is  shared 
between  all  mixture  components.  This  appendix  presents  the  DPMFA,  adapted 
from  Ghaharamani  and  Beal  [68]  and  Wang  et  al.  [94],  including  derivations  for  all 
variational  Bayesian  (VB)  update  equations  and  the  negative  free  energy  (NFE). 

D.l  Model  and  Variable  Definitions 

(xn|cnm  =  1)  ~  A/b(Adiag  (zm)  sn  +  fjtm,  diag  (/*/>m)-1)  (D.l) 

n  —  1,  2, ...,  N  is  data  index 

m  =  1,2 is  mixture  component  index  (T  is  arbitrarily  large) 
d  —  1,  2, ...,  D  is  data  dimension  index 
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k  —  1,  2, K  is  factor  index 
x„  is  D  x  1  data 

A  =  [ai,  a2, b.k\  is  D  x  K  factor  matrix 

z m  =  [zml,zm2,  ...,zmk]t  is  K  x  1  binary-coded  selection  vector 

sn  =  [sin,S2n,  ■■■,skn]t  is  K  x  1  score  vector 

fj,m  =  fim,D]T  is  D  x  1  component  mean  vector 

i\)m  =  [-0ml,  4) m2,  •••,'0m_d]T  is  D  x  1  component  precisions 

cnm  is  a  binary-coded  latent  variable 

D.2  Priors 


p(Adk  I'fdk) 

rs_/ 

A/"  (A^  0, ) 

(D.2) 

P(Sn|5) 

rv./ 

A/p  (sn  0, 5_1l) 

(D.3) 

P  ( Zmk  | Pmk) 

rs_/ 

Bernoulli 

(D.4) 

P  ( Vmk  ) 

rv./ 

Beta  (r^mfc  ao/A',  60(A'  -  1)/AT) 

(D.5) 

P  (' Idk ) 

rs_/ 

Gamma  (7dfe  e0,  /o) 

(D.6) 

P^ml^m) 

rs_/ 
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(D.7) 

P  {'Ipmd) 

rs_/ 
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(D.8) 

P(Cn) 
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(D.9) 

TTm  (v) 
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vm  ]^[(1  -  Vi ) 

l<m 

(D.10) 

p(vm\a) 

rv_/ 

Beta  (wml,  a) 

(DU) 

p(a) 

rs_/ 

Gamma  (a  ri0,  T2o) 

(D.12) 

p(S) 

rv./ 

Gamma  (55io,  £20) 

(D.13) 
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D.3  Model  Likelihood 


The  joint  likelihood  of  data  given  all  model  parameters  is  given  by 

N  T 

P  (x|  A,  s,  Z,f)  =  Y[  n  M?(Adiag  (zm)  sn  +  fxm ,  diag  (if>. J"1)0™  (D.14) 

i= 1  m=l 

Use  log-likelihood  for  analysis 
logp(X|A,S,Z,*) 

=  Cnm  _  2  l0g  2?r  +  l0g  ldiag  I 

n= 1  ra=l 

+  [xn  -  (Adiag  (zm)  sn  +  nm)]T  diag  (V>m)  [xn  -  (Adiag  (zm)  sn  +  /Ltm)]) 

1  V  T 

=  “  2  ^  S  Cnm  (Xndi&g  Wm)  xn 
n=l  m=l 
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+  [Adiag  (zm)  sn  +  £tm]T  diag  (^m)  [Adiag  (zm)  s„  +  /x  J 

+  log  | diag  OJ_i  I  +  ^  log  2vr) 

^  JV  T 

=  -sEE  Cnm  (^-n  diag  (V’m)  Xn  -  2x^diag  (^m)  [Adiag  (zm)  sn  +  /x  J 

n=l  m=l 

+  [Adiag  (zm)  sn  +  /Lth]T  diag  (^m)  [Adiag  (zm)  sn  +  /ifc] 

+  log  | diag  (^m)_1  I  +  ^  log  2vr) 

(D.15) 
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Converting  some  of  the  terms  to  sums  yields 
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where 


p 


log  diag  (V\J  1  |  =  -  ^2  l°S^md 

3= 1 

(D-18) 

D 

x^cliag  (^J  xn  =  ^2  xl^md 

(D.19) 

<2=1 
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Rephrasing  the  log-likelihood  using  these  new  quantities  yields: 


log  p  (X|A,  S,  Z,  \I>) 
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d=l  fc=l  j=l 


(D.20) 


where  xj[m  =  xnd  -  E Adizmlsln. 

D.4  Variational  Posterior  on  A 

It  was  derived  in  Section  E.10  that  the  NFE  is  maximized  by  a  variational  posterior, 
g( A),  that  is  proportional  to  the  variational  expectation  of  the  true  log-posterior 
with  respect  to  all  other  model  parameters: 


9(A)  oc  (logp(A|X,  — )) 


(P-21) 
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The  true  log-posterior  may  be  calculated  from  Bayes’  theorem 

logp(A|X,  -)  =  logp(X| A,  -)  +  logp(A)  -  E,  (D.22) 

where  E  denotes  a  normalizing  constant.  The  variational  posterior  can  be  calculated 
by  solving  the  true  log-posterior  as  a  function  of  A.  then  taking  the  variational 
expectation  (•): 


log  p(A|X. 


D  K 

XX 

d=  1  k=  1 


N 


^-A-dk  ^  ^  ^md^mk  ^  ^  ^nm^kn  ndm  f^md) 


m= 1  n=l 

T  N 


■b  Adk  'Jdk  +  'y  ]  y  ' 


CnmSkn 


m=  1  n=l 


-  x. 


Completing  the  square  reveals  that  Adk  is  Gaussian: 

D  K 

logp(A|X,  — )  =  XX  log  N{udk,(jdk)r 


where 


d=  1  k= 1 


N  T 

a dk  =  |  7 dk  +  y  ^  y  ^  cnm'4,rndZrnkSkn 
n=  1  ra=l 


-1 


(D.23) 


(D.24) 


(D.25) 
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N  T 


Udk  ~  &dk  nm^md^mk^kn  i^ndm  l^md) 

Ln=l  m=  1 

Useful  moments  in  VB  updates  for  other  model  parameters: 


\A-dk)  W dk 

(Adk)  =  Udk  +  adk 

{AdkAdi)  =  (Adk)  {Adi) 

D.5  Variational  posterior  on  S 


(D.26) 


(D.27) 

(D.28) 

(D.29) 


It  was  derived  in  Section  E.10  that  the  NFE  is  maximized  by  a  variational  posterior, 
q( S),  that  is  proportional  to  the  variational  expectation  of  the  true  log-posterior 
with  respect  to  all  other  model  parameters: 


q  (S)  oc  (logp(S|X,  — )) 


(D.30) 


The  true  log-posterior  may  be  calculated  from  Bayes’  theorem 

logp(sn|xn,  -)  =  log p(xn|sn,  -)  +  log p(sn)  -  E,  (D.31) 

where  E  denotes  a  normalizing  constant.  The  variational  posterior  can  be  calcu¬ 
lated  by  solving  the  true  log-posterior  as  a  function  of  s,  then  taking  the  variational 
expectation  (•): 
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log  p(sn|xn,  -) 

=  -  2  Cnm  2x^diag  (^TO)  [Adiag  (zm)  sn  +  /xm] 

m=l 

+  [Adiag  (zm)  s„  +  mJt  diag  (V\J  [Adiag  (zm)  sn  +  /x  J  +  logjdiagfi/vr1  I 
+  I2ieg'27r)  -  ^s£<SIsn  -  ^Mg2n  -  E 
1  T 

=  -  2  (_2  cnmX„diag  (V>m)  [Adiag  (zm)  sn  +^j 

m=l 

T 

+  Cnm  [Adiag  (zm)  sn  +  £xjr  diag  0,n)  [Adiag  (zm)  s„  +  Mm]  +  s^5Isn  j  -  E 

m—  1 

1  T 

=  -  2  (_2  cnmX„diag  (i/\n)  Adiag  (z,m)  sn 
m=l 

T 

+  ^2  Cnm  [Adiag  (zm)  s„,]T  diag  (i/>m)  [Adiag  (zm)  sn] 

m=l 


2  ^  cnm  [Adiag  (zm)  sn]T  diag  (^m)  Mm  +  +  s^Isn)  -  £ 


m=l 


2sn  ^  ^  Cnm  [Adiag  (z,„)]T  diag  Om)  [xn  -  M, 


m=l 

T 


+  £  «+£  C"nm  [Adiag  (zm)]T  diag  (i/>m)  [Adiag  (zm)] 


m=l 


—  E 


(D.32) 


Completing  the  square  reveals  that  Skn  is  Gaussian: 

p(sn|xn,  -)  =  log  J\fK(£n,  Are) 


(D.33) 
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where 


-1 


An  =  ( (51  +  y,  cnm  [Adiag  (zm)]T  diag  (xj)m)  [Adiag  (zm)] 


m= 1 


€n  =  An  I  ^2  cnm  [Adiag  (z m)]i  diag  (V>m)  [xn  -  nr 

\m= 1 

Useful  moments  in  VB  updates  for  other  model  parameters: 

(sn)  — 


(S  X)  =  £n£n  +  A 


D.6  Variational  Posterior  on  z 


(D.34) 


(D.35) 


(D.36) 

(D.37) 


It  was  derived  in  Section  E.10  that  the  NFE  is  maximized  by  a  variational  posterior, 
g(Z),  that  is  proportional  to  the  variational  expectation  of  the  true  log-posterior 
with  respect  to  all  other  model  parameters: 

Q  (Z)  oc  (logp(Z|X,  — ))  (D.38) 

The  true  log-posterior  may  be  calculated  from  Bayes’  theorem 

l°gp(Z|X,  -)  =  logp(X|Z,  -)  +  logp(Z)  -  E,  (D.39) 

where  E  denotes  a  normalizing  constant.  The  variational  posterior  can  be  calcu¬ 
lated  by  solving  the  true  log-posterior  as  a  function  of  z,  then  taking  the  variational 
expectation  (•): 


log p(zmk  =  1|-)  OC  logp(X\zmk  =  1,  -)  +  log(7Tmfc)  -  E  (D.40) 

\ogp(zmk  =  0|-)  oc  logp(X|^mfc  =  0,  -)  +  log(l  -  7lmk)  -  E  (D.41) 
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Where  logp(X.\zmk  =  1,  — )  and  logp(X| zmk  =  1,  — )  are  given  by  (D.20)  with  zmk  set 
to  equal  1  or  0: 


log  p(X.\zmk  =  1,  -) 


N 


-  X' 


n=  1 


D 


dPmd  +  p md )  “h  /  ^  'll’md^dkS 


<2=1 


1  kn 


D 

~  ^  ^ mdAdkSkn  ( Xndm 

d=  1 


1 


V  D 

2  Cnm 
n=l  <2=1 


'lPmdAdkS 


kn 


N  D 

+XX  Cnm^md-^-dk^kn  (%ndm. 

n=  1  d=  1 


-  E 


log  p(X|zmfe  =  0,  -) 


1 

2 


—  E 


=E 


Therefore,  p(zmk\  — )  ~  Bernoulli(pfcn)  ,  where 


_  eXP(Cmfc) 

Pmk  /i\ 

exp(Cmfc)  +  eXP(C2) 


iV  D 


ci1!  =  log(7rmfc)  -  ^  EX' 


x^mdA2dkd2"2 


k^kn 


n= 1  <2=1 


at  n 

T  ^  ^  ^  ^  Cnm'lpmdAdkdkSkn  [^ndm 

n=  1  <2=1 


(D.42) 


(D.43) 


(D.44) 


C2  =  l0g(l  -  7Tmfc) 


(D.45) 


Useful  moments  in  VB  updates  for  other  model  parameters: 


(^mfc)  Pmk 


(D.46) 
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(D.47) 


(zmk)  =  Pmk  i  since  zmk  is  binary. 

(ZmkZml)  =  {Zmk)  (Zml)  (D.48) 

D.7  Variational  Posterior  on  fi 

It  was  derived  in  Section  E.10  that  the  NFE  is  maximized  by  a  variational  posterior, 
q(n),  that  is  proportional  to  the  variational  expectation  of  the  true  log-posterior 
with  respect  to  all  other  model  parameters: 

Q  (/O  oc  (logpOJX, -))  (D.49) 

The  true  log-posterior  may  be  calculated  from  Bayes’  theorem 

l°gp(/vlx,  -)  =  !ogp(xlMm;  -)  +  log ~  E>  (D.50) 

where  E  denotes  a  normalizing  constant.  The  variational  posterior  can  be  calculated 
by  solving  the  true  log-posterior  as  a  function  of  nm ,  then  taking  the  variational 
expectation  (•): 
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log  p(fim |X,  — ) 


=  -\  Y  c™n  ([x„  -  Acliag  (s„)  -  n]T  diag  (^m)  [xn  -  Adiag  (sn)  -  pm] 

n= 1 

+  log  Idiag^/CT^T  +  ^  ([pm  -  p0]T  «0diag  (>J  [pm  -  P0] 

+  lo^\v^}4isz^  -E 

=  -  2  Y Cnm  (^mdiag  C*/\J  Pm  -  2Pmdiag  Om)  [x„  -  Adiag  (sn)] 

n= 1 

+  ^j-AAiag]^  -  ^  (Pm^odiag  (V\J  pm 

-  2^M0diag  OJ  Po  +  ^oWiagf^Jp^)  -  E 
1  w 

=  -  2  _2Pm  («odiag  (VhJ  Po  +  cnmdiag  (V\J  [xn  -  Adiag  (sn)]) 

n= 1 
N 

+  Y  Onrn  Pm  diag  (V\n)  Pm  +  Pm^odiag  (^m)  Pm  -  E 

n=  1 

(D-51) 

Therefore, 
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Useful  moments  in  VB  updates  for  other  model  parameters: 


{P>m)  =  Pm  (D.55) 

(PmPm)  =  Pm  Pm  +  Um  (D-56) 

D.8  Variational  Posterior  on  ip 

It  was  derived  in  Section  E.10  that  the  NFE  is  maximized  by  a  variational  posterior, 
q(ip),  that  is  proportional  to  the  variational  expectation  of  the  true  log-posterior 
with  respect  to  all  other  model  parameters: 

q  (ip)  oc  (logpO|X,  -))  (D.5T) 

The  true  log-posterior  may  be  calculated  from  Bayes’  theorem 

logpOlX,  -)  =  logp(X|?/>,  -)  +  logpO)  -  E,  (D.58) 

where  E  denotes  a  normalizing  constant.  The  variational  posterior  can  be  calculated 
by  solving  the  true  log-posterior  as  a  function  of  ip,  then  taking  the  variational 
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expectation  (•): 


log  p('0|X,  — ) 


^  N  T  D  D  K 
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dk^mk0  kn 


n=  1  m=l 


d=l 
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d=  1  fc=l  j  =  l 


T  D 

+  EE  [fi'Aeg'A)'  -  logPt^X  +  (fi'O  -  1)  log  ^md  -  h^mdl  -  E 

771=1  d=l 


TV 


EE- 

77=1  777  =  1 
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d=  1 


r)  ^  ^  4’md  (xnd  ^xnd^md  +  md  y  1  A 


dkm.kkn 


k= 1 


A'  i  P 

2  ^  ^  AdkZmkSkn  \Xndrri  /^md]  J  “I"  “  ^  ^  log  'Ipmd 

k= 1  ^  j=l 


T  D  T  D 
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777=1  d=l  777=1  d=l 

(D.59) 
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Exponentiating  yields: 

pW  |x,-) 


TP  N  i  i 
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E-4 
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dkmkkn 


2  ^  ^  A-dk^rnkSkn 


T  P  viV 
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“iin^  exp 
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^ md  1 


“1“  2  ^  v  ^nm  y%nd  ^^ndl^md  l^md 
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^  1  Adkzmkskn  2  ^  ^  ^4 dkzmkskn  ( Xndm  Pmd) 


(D.60) 


Therefore, 


pO|X,  -)  =  HU  Gamma  (gmd,  hr 


t=  i  j=i 


(D.61) 


where 


_  Z— m=l  m  ,  ,, 

9md  —  - ^ - P  #0 


(D.62) 
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(D.63) 
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hmd  —ho  +  „  ^  ^  Cnni  Xnd  2 Xndl~lmd  +  Pmd  d~  ^  1 


2 

kn 


n=l 


fc=l 


K 


2  ^  ^  ^WfcUnfc'Sfcn.  (%ndm  P"md) 
k= 1 

Useful  moments  in  VB  updates  for  other  model  parameters: 


(' Ipmd ) 


Qmd 

hmd 


(log  Vw)  =  Digamma(gmd )  -  loghmrf 


D.9  Variational  Posterior  on  7r 


(D.64) 

(D.65) 


It  was  derived  in  Section  E.10  that  the  NFE  is  maximized  by  a  variational  posterior, 
q(- 7r),  that  is  proportional  to  the  variational  expectation  of  the  true  log-posterior 
with  respect  to  all  other  model  parameters: 


g  (vTmfc)  OC  (log p(7Tmfc | -))  (D.66) 

The  true  posterior  may  be  calculated  from  Bayes’  theorem 
p{^mk\  )  OC  p(Zmk\Tt mk)p{lt mk) 


,  “o  i  bnt^-U 

K  nmk  (!  -  nmk)  (1  -  7 Tmk)  K 

Zmfc  +  TF- 1  XZ  l  +  i>0(K_1)+l  1 

OCTtJ*  K  (l-7Tmfc)^fe+  *  +1 

(D.67) 

Therefore, 

P^mk\  )  Bets,  ^ mk )  ? 

(D.68) 

where 
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& mk  r . 
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(D.69) 

ft  -  l  6»(ir  - !)  |  i 

umk  ^ mk  "T  \  J- 

(D.70) 
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Useful  moments  in  VB  updates  for  other  model  parameters: 

(nmt)  =  a-fj  (D.71) 

(log  7 Tmk)  =  Digamrna(amk )  -  Digamma(amk  +  bmk )  (D.72) 

(log(l  -  7 Tmk))  =  Digamma(bmk )  -  Digamma(amk  +  bmk)  (D.73) 

D.10  Variational  Posterior  on  7 


It  was  derived  in  Section  E.10  that  the  NFE  is  maximized  by  a  variational  posterior, 
q( 7),  that  is  proportional  to  the  variational  expectation  of  the  true  log-posterior 
with  respect  to  all  other  model  parameters: 

q{ldk)  oc  (logp(7dfc|-))  (D.74) 


The  true  posterior  may  be  calculated  from  Bayes’  theorem 
P{ldk\~)  ^  P{Adk\ldk)p{ldk) 

ldkA2dk 
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2  exP 


Idk  exP  (~/o7 dk) 
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/o  +  Vf 


Therefore, 


where 


p(jdkj-)  oc  Gamma {edk,fdk), 


edk  —  Co  +  - 


fdk  =  fo  +  ^r 


Useful  moments  in  VB  updates  for  other  model  parameters: 


/  \  6 dk 

\7dk)  =  -7— 
J  dk 


(D.75) 


(D.76) 


(D.77) 

(D.78) 


(D.79) 
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D.ll  Variational  Posterior  on  C 


It  was  derived  in  Section  E.10  that  the  NFE  is  maximized  by  a  variational  posterior, 
q( C),  that  is  proportional  to  the  variational  expectation  of  the  true  log-posterior 
with  respect  to  all  other  model  parameters: 


g(C)oc(logp(C|— )>  (D.80) 

The  true  log-posterior  may  be  calculated  from  Bayes’  theorem 

log p(cnm  =  1|-)  =  log p(xn | cnm  =  1,  -)  +  log p(cnm  =  1)  -  E,  (D.81) 

where  E  denotes  a  normalizing  constant.  The  posterior  will  also  be  multinomial  with 
parameters  (responsibilities) 
log  p(cnm  =  1|-) 


D 

]  Pmd  (xnd 
d=  1 


D  K 

‘ZXndPmd,  +  Pmd)  “1“  EE  IpmdA-dkZ 
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D  K  P 
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d=  1  k= 1  j= 1 

+  log  Vm  +  ^  1°S  (!  -  Vl)  ~  E 
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(D.82) 


Useful  moments  in  VB  updates  for  other  model  parameters: 

(Cnm)  tfrnm  (D.83) 


D.12  Variational  Posterior  on  v 


It  was  derived  in  Section  that  the  NFE  is  maximized  by  a  variational  posterior, 
q(y),  that  is  proportional  to  the  variational  expectation  of  the  true  log-posterior 
with  respect  to  all  other  model  parameters: 

q(v)  oc  (logp(um|— ))  (D.84) 
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The  true  log-posterior  may  be  calculated  from  Bayes’  theorem: 


log p(vm |  )  =  logp(C|um)  +  log p(vm)  -  E, 


(D.85) 


where  E  denotes  a  normalizing  constant.  The  variational  posterior  can  be  calculated 
by  solving  the  true  log-posterior  as  a  function  of  v,  then  taking  the  variational 
expectation  (•): 

N  N 


log  p(vn 


^cnm\ogvm  +  EE  zni  log(l  -  vm)  +  (a  -  1)  log(l  -  vm)  -  K 


n= 1 
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n= 1  l>m 


(D.86) 


Therefore, 


where 


log p(vm\  )  =  Beta(z/ti,  ut2) 
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vn  = 


i+E' 


n=  1 


N 


"t.2  =  a + 


n= 1  s>m 

Useful  moments  in  VB  updates  for  other  model  parameters: 


(D.87) 


(D.88) 


(D.89) 


(hr  vm)  =  ip(i/tl)  -  ^{utl  +  ut2) 


(ln(l  -  vm ))  =  i>(ut2)  -  ^(un  +  ut2) 


(D.90) 

(D.91) 


D.13  Variational  Posterior  on  a 


It  was  derived  in  Section  that  the  NFE  is  maximized  by  a  variational  posterior, 
q(a),  that  is  proportional  to  the  variational  expectation  of  the  true  log-posterior 
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with  respect  to  all  other  model  parameters: 


q(a)  oc  (logp(a|-)) 


(D.92) 


The  true  posterior  may  be  calculated  from  Bayes’  theorem: 


logp(a|— )  =  logp(v|a)  +  logp(a)  —  E, 


(D.93) 


where  E  denotes  a  normalizing  constant.  The  variational  posterior  can  be  calculated 
by  solving  the  true  log-posterior  as  a  function  of  a,  then  taking  the  variational 
expectation  (•): 
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(D.94) 


Therefore, 
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m=l 

Useful  moments  in  VB  updates  for  other  model  parameters: 


(a) 


Tl 

t2 


(D.95) 


(D.96) 

(D.97) 


(D.98) 
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D.14  Variational  Posterior  on  6 

It  was  derived  in  Section  that  the  NFE  is  maximized  by  a  variational  posterior, 
q(S),  that  is  proportional  to  the  variational  expectation  of  the  true  log-posterior 
with  respect  to  all  other  model  parameters: 

q(S)  oc  (logp(<5|-))  (D.99) 


The  true  posterior  may  be  calculated  from  Bayes’  theorem: 
p(8 H  ocp(S|5)p(5) 
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i=l  V  k=\  ) 

(D.100) 

<$*1°+A'JV/2-1  exp  -5  (sto  +  ^  skn) 

V  n= 1  fc=l  /  . 

Therefore, 

q(S)  oc  Gamma((5i,  52), 

(D.101) 

where 

s  r  KN 

Oi  —  oio  +  2 

(D.102) 

1  N  K 
^2  =  ^20  +  ^ 

(D.103) 

n=  1  fc=l 


Useful  moments  in  VB  updates  for  other  model  parameters: 

W  =  (D.104) 

02 

D.15  Negative  Free  Energy 

The  NFE  can  be  expressed  as  the  difference  between  the  expected  log-likelihood  and 
the  Kullback-Leibler  divergence  (KLD)  between  the  variational  posteriors  and  the 
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priors: 
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m=l 


(D.105) 


where 

(logp(X|A,S,Z,^)) 


N  T 


^2^2{Cnm) 


n=  1  m=  1 
D  K 


D 


^  (^nd  2 Xnd({lmd)  +  (hmd)) 


d=l 


X)  J2^md^Adk)(Zmk)(sln) 


d=  1  A;=l 


—  2  ( AdkZmkSknXndm ) 

d=  1  fc=l 

D  K 

2  ^  ^  ^2{lpmd)  (A dk)  (Zmk)  (Skn)  {l-lmd) 
d=  1  fc=l 

P 

+  (logJJ^)  +Dlog2vr  , 
j=i 
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(D.106) 


and  KILO  [q(Adk)\\p(Adk)\  is  a  KLD  between  two  Gaussian  distributions, 

KLD  [g(sn)||p(sn)]  is  between  two  K- dimensional  Gaussian  distributions, 

KLD  [q(zmk)\\p(zmk)\  is  between  two  Bernoulli  distributions,  KLD  [g(/Ltm)||p(/zm)]  is 
between  two  D- dimensional  Gaussian  distributions,  KLD  [q(/ipmd)\\p('lPmd)]  is  between 
two  Gamma  distributions,  KLD  [q{^tk)\\pi^tk)}  is  between  two  Beta  distributions, 
KLD  [qi'jdk^Wpildk)]  is  between  two  Gamma  distributions,  KLD  [g(cn)||p(cn)]  is  be¬ 
tween  two  multinomial  distributions,  KLD  [q(vm)\\p(vm)\  is  between  two  Beta  distri¬ 
butions,  KLD  [g'(a)||p(a)]  is  between  two  Gamma  distributions,  and  KLD  [?(£)||.p(<5)] 
is  between  two  Gamma  distributions. 
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Appendix  E 


Discriminative  DPGMM-RVM 


The  DPGMM-RVM  hybrid  model  is  used  for  discriminative  context  learning  in  Chap¬ 
ter  5.  The  model  is  constructed  based  on  the  mixture-of-RVMs  presented  in  Ap¬ 
pendix  B.3  where  the  latent  mixing  variables  are  governed  by  a  DPGMM,  which  was 
described  in  Appendix  C.  Therefore,  the  derivations  for  the  update  equations  and 
NFE  for  the  DPGMM-RVM  are  very  similar  to  the  individual  RVM  and  DPGMM 
models.  Learning  the  DPGMM  seeks  to  jointly  cluster  the  contextual  features  (X^)) 
and  classify  the  target  features  (X(V>)  according  to  the  labels,  t. 

E.l  Generative  Model  and  Variable  Definitions 


Unm 

=  wT  Xt'T'1 
m  n 

(E.l) 

1) 

~  °{ynm)tn  [1  -  V{ynm)]l~tn 

(E.2) 

x(c)lc  =1) 

rnm  -1-/ 

rN-'  Am  ) 

(E.3) 

xiT)  is  d(t)  x  1  target  feature  vector 
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is  D{c >  x  1  contextual  feature  vector 
tn  is  binary  class  label 
cn  is  binary-coded  latent  variable 
n  —  1,  2, N  is  data  index 
rn  —  1,  2, T  is  mixture  component  index 
d=  1,2,...,D(C)  or  DW 

is  dimension  index 

E.2  Priors 


{Pmi  Am) 

~  A fD(C)  (UmlPoi  U0  ^m1)  W  (Am  B0,  V0) 

(E.4) 

W  m 

~  M D(t)  (0,  diag  (/3m)_1) 

(E.5) 

ftmd 

~  Gamma  (a0,  bo) 

(E.6) 

C-n 

~  Multinomial(7rn) 

(E.7) 

=  vm  JJ(1  -  Vi) 

(E.8) 

l<h 

~  Beta  (1,  a) 

(E.9) 

a 

~  Gamma  (tio,t2o) 

(E.10) 

E.3  Model  Likelihood 

The  joint  likelihood  of  labels  and  context  features,  given  all  model  parameters  is 
given  by 

N  T 

p{ t,X(<7,|-)  =  n  II  [a(ynm)tn  [1  -  v(ynm)}l~tn  ND(C)  (x^C Vm>  Am  ) ]  ^  (K11) 

n=  1  m=  1 
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s: 


N  T 


i°gp(t, x(<  '|— )  Cnm  I  logCr(?/nm)  +  (1  tn)  log  [1  ^(|/nm)] 


n= 1  ra=l 


Dtc>  log  2ir  +  log  |  Aro‘  |  +  (x,(,c>  -  pm)T  Am  (xjf1  -  /im) 


(E.  12) 


E.4  Variational  Posterior  on  fi  and  A 


It  was  derived  in  Section  E.10  that  the  NFE  is  maximized  by  a  variational  posterior, 
q(n,  A),  that  is  proportional  to  the  variational  expectation  of  the  true  log-posterior 
with  respect  to  all  other  model  parameters: 


q(fi,A)  oc  <logp(/i,  A|-)) 


(E.  13) 


The  true  log-posterior  may  be  calculated  from  Bayes’  theorem: 

l°gP  (M>  A|— )  =logp(t|X(T),W, -)  +  logp  (X(C)|/i,  A,  — ) 
+  logp  (/it|  A)  +  logp  (A)  -  K, 


(E.14) 


where  K  denotes  a  normalizing  constant.  The  variational  posterior  can  be  calculated 
by  solving  the  true  log-posterior  as  a  function  of  /x  and  A,  then  taking  the  variational 
expectation  (•): 
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log  pO,  A|-) 


N  T 


VVc  ur\  u  —  2ut A  x{c)  +  x(c)  A  x^c) 


n=l  m=l 
T 


—  -  E  -  2/^w0Amra0  +  morMoAmm0] 


ra=l 

T 

ra=l 

T 

--E 

2  ^ 

ra=l 


AT 


Tr  (B0  1  Am)  -  t'o  +  E  Cnm  -  J°(G)  _  1  log  I 


71=1 


-  A' 


^MmAm  ^OPo  d~  ^  ^  Cnm.Xn  j  +  Mm  fe  Cnm  ~h  A ■mfJ'r 


N 


P m^mAm  p  1n  P / n  ^  < < i  A rn  P, , ,  “1“  ^  ^  CnmX,^  '*  AmXn  ^  "h  ^Iq  ^oAmTUo  "h  Tl  (Bq  A;n  j 


n=l 


TV 


-  ,o  +  E Cnm  _  jD(C)  _  1  log  i A' 


n=l 


-  A' 


(E.15) 


Completing  the  square  in  the  first  two  terms,  and  then  using  the  identity  a7  Ba  = 
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Tr  (aaTB)  yields: 
log  p(p,  A|-) 


m= 1 


Pm)  ^m-^m  (Pr 


Tr  (umMmAm)  +  Tr  (CmAm) 


+  Tr  (mqM0A0)  +  Tr  (B0  1  Am) 


+ 


N 

^  ^  Cnm 
n=  1 


D ^  log  |  Ar 


-K 

(E.16) 


Where 


N 

^0  ^  ^  C-nm 

n— 1 

(E.17) 

i  (O) 

UoPo  /  jfi  —  l  ^nm^-n 

Pm 

Um 

(E.18) 

•M-m  Pm  Pm, 

(E.19) 

N 

c  -  V  c  X(C)X(C)T 

'-'ra  /  v  L  nm^n  -^n 

n=  1 

(E.20) 

M0  =  PoPo 

(E.21) 
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Consolidating  the  last  four  terms  using  the  identity  Tr  (A)+Tr  (B)  =  Tr  (AB)  yields: 
log  p(/i,  A|-) 


'y  [  {Pm  Pm)  umA-m  {Pm  Pm)  "b  Tt^CmAm  MmMmA 


UoMqAo  +  B0  1  Am)  -  (  VQ  +  Y  cnm  -  D[C)  -  1  log  |  Am|  -  K 


2  'y  !  ( Pm  Pm)  um^-m  ( Pm  Pm) 
m=  1 

+  Tr  [(Cm  —  umMm  +  mqMq  +  B0  1)  Am] 


Z'o  +  Y  Cnm  -  D{c)  —  1  )  log  |  Am|  -  K 


(E.22) 


Consolidating  terms  reveals  that  p,  A  are  Normal-Wishart: 


logp  {p,  A|-)  =  loS  [N  (. Pm\Pm,  UmAm)  W  (A|  Um,  Bm)]  (E.23) 


where  pm  and  um  are  defined  above,  and 


Vm  =  U)  +  Y  Cn 


(E.24) 


B m  ~  (Cm  —  «raMm  +  WoM0  +  BQ 


(E.25) 


Useful  moments  in  VB  updates  for  other  model  parameters: 


( Mm )  Pn 


(E.26) 


(Am)  //mBr 


(E.27) 


(MmMm)  =  PmpJ  +  UmKr'B  1 


(E.28) 
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(E.29) 


<log|A.|>  =  5>(*^±!) 

d=  1  '  ' 

((XiG)  -  Mm.)T  Am.  (Xf}  -  Urn))  =  ~  P) 

E.5  Variational  Posterior  on  w 


+  D{C)  log  2  +  log  |Bm| 
(Xn  ^  ~  Pm)  d" 


^(C) 

^ m 


(E.30) 


It  was  derived  in  Section  that  the  NFE  is  maximized  by  a  variational  posterior, 
q{ W),  that  is  proportional  to  the  variational  expectation  of  the  true  log-posterior 
with  respect  to  all  other  model  parameters: 

q(W)  oc  (logp(W|— ))  (E.31) 


The  true  log-posterior  may  be  calculated  from  Bayes’  theorem: 

logp(W|-)  =logp(t|TP,X(T),-)  +logp(X(T)|-) 
+  log  p(W)-K, 


(E.32) 


where  K  denotes  a  normalizing  constant. 

Because  the  binomial  distribution  on  t  does  not  offer  conjugate  updating  for  our 
choice  of  the  prior  on  w,  we  impose  a  lower-bound  approximation  to  p(tn|wm,x„  ;): 

p(tn|wm)  =0-  ( Vnmf"  [1  ~  CT  (t/nm)]1'*" 

(E.33) 


where  £nm  is  a  variational  parameter  and 


>c  (U)  exP 


I~inrn  Sn 


(£nm)  (p/nm 


r) nm  (2tn  1)  Hr 


A  (V.m)  =  - y—  tanh  f  ^ 


(E.34) 

(E.35) 
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Using  the  approximation  for  p(tn|wm, x^),  the  variational  posterior  can  be  cal¬ 
culated  by  solving  for  p(W|— )  as  a  function  of  W,  then  taking  the  variational 
expectation  (•): 
log  p  (W|— ) 


n= 1  m=  1 


^  ^  ^  ^  Cnm  (  ~  (7nra  £gfvrn)  ^  ((nm) 


\  lQg^C?l  -  X  (Xn  ;)  - 


-  x  X/  [^®g^  +  IogW+  w„Awm]  -  K 


"  Wm  AWm  +  ^  ^  cnm  (2A  (£nm)  7nm  7 nm) 


”  9  5Z  WmAwm  +  Cnm  (2A  (&*"*)  WmXn  ’x^  Wm  ~  (2t„  -  1)  W^X^ 


i  T  r  /i  * 

-  ^  -2w£  I  -  ^  cnm  (2fn  -  1)  xf) 


Wm  (  A  +  2  cnmA  (fnm)  Xnx|(r)  w 


(E.36) 


Completing  the  square  reveals  that  W  is  Gaussian: 


logp  (W|— )  =  logA/-  (wm|u?m,  Sr 


(E.37) 


where 


^ m  ~  ^  ^  ^nm  (2 tn  l) 


(E.38) 


278 


N 


-1 


^ra  —  A  +  2  ^  ^  CnmX  ^X, 


C T)J 


n= 1 


Useful  moments  in  VB  updates  for  other  model  parameters: 

(wm)  =  u>m 

(wmw^)  =  u>mu; 

m  H“ 


(E.39) 


(E.40) 

(E.41) 


E.6  Variational  Posterior  on  £ 


The  updates  for  the  variational  parameter  £  are  derived  by  directly  optimizing  the 
Negative  Free  Energy: 

£  =(l°gP(t,  X(°) |-))  -  KLD  [g  (M,  A)  |  |p  (/x,  A)]  -  KLD  [q  (W)  1 1 p  (w)] 

-  KLD  [q  (A)  |  |p  (A)]  -  KLD  [q  (Z)  1 1 p  (Z)]  -  KLD  [q  (v)  1 1 p  (v)]  -  KLD  [q  (a)  \  \p  (a)] 

(E.42) 


If  =  ^lo§^(t’X(C)  H>  =  wm,xiT))) 


(E.43) 


Substituting  the  approximation  for  p(tn|wm,  xf^): 


<9£ 

d£ 


T  D(t) 

■EE 

m=l  d=l  L 
T  Z)(T) 

EE 

m=l  d=l 


1  - .  <9A(£md) 


1  +  2 


0  T  ^md,X{imd) 


d£,md 


((7 D  Cmrf) 


e-€md/2  ^  lg^md/2  _  lg  £md/2  ]_  <9A  (£mrf) 

2  <9£mci 


+ 


gCmd/2  _J_  g— Cmd/2  _|_  g-£md/2 


((7md)  -  £md) 


T  AT) 

EE 

m=l  d=l 
T  D^T) 


IgCmd/2  +  le  U/2  _  1  _  dA(£mrf) 

gCmd/2  g-5md/2  2  <9£md 


\  \  ^  9\(C,mcj)  /,  2  \  f2  \ 

/  ^  nc  \\imd/  S md 1 

m=l  <f=l 


((7D  -  £L) 


(E.44) 


Because  the  derivative  of  A(£mf/)  is  purely  negative,  £  is  maximized  at 

Cnd  =  (7 L)  =  x(T)T(wmw^)x(T) 


(E.45) 
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E.7  Variational  Posterior  on  (3 


It  was  derived  in  Section  that  the  NFE  is  maximized  by  a  variational  posterior, 
q((3),  that  is  proportional  to  the  variational  expectation  of  the  true  log-posterior 
with  respect  to  all  other  model  parameters: 

q(P)  oc  <logp(/3|— ))  (E.46) 

The  true  posterior  may  be  calculated  from  Bayes’  theorem: 

p(/3|-)ocp(W|/3)p(/3)  (E.47) 


The  variational  posterior  can  be  calculated  by  solving  the  true  posterior  as  a  function 
of  (3,  then  taking  the  variational  expectation  (•): 


T  D m 

pm— i“iinv&“p 


OC 


m=  1  d=  1 

T  D(t ) 

n  n  a 

m=  1  d=  1 

t  Dm 


^mdwLA  ba°y 


md  \  u(y  oao  —  1 


(Oo) 


Pnd  exp  (-boPmd) 


md 


exp 


Pmdwmd 


Pnd  1  eXP  (-fc0 Pmd) 


OC 


II II  iexp  (  -Pmd 


jao+V  1 

F 

m= 1  d=  1 

Therefore,  the  P's  are  Gamma  distributed: 


bo  +  - Wm 


t  Dm 

P(P\-)=  n  n  Gamma  (Pmd\(Fnd,  b 


md  ) 


m= 1  d=  1 


where 


Umd  —  a  0+2 


bmd  —  bo  + 

Useful  moments  in  VB  updates  for  other  model  parameters: 


{Pmd) 


(E.48) 


(E.49) 


(E.50) 

(E.51) 

(E.52) 


-'md 
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E.8  Variational  Posterior  on  V 


It  was  derived  in  Section  that  the  NFE  is  maximized  by  a  variational  posterior, 
g(v),  that  is  proportional  to  the  variational  expectation  of  the  true  log-posterior 
with  respect  to  all  other  model  parameters: 


q(v)  oc  (log p(vm  |  )) 


(E.53) 


The  true  log-posterior  may  be  calculated  from  Bayes’  theorem: 


log p(ym |  )  =  logp(C|um)  +  log p(vm)  -  K, 


(E.54) 


where  K  denotes  a  normalizing  constant.  The  variational  posterior  can  be  calculated 
by  solving  the  true  log-posterior  as  a  function  of  v,  then  taking  the  variational 


expectation  (•): 

log  p{Vm  |  )  = 


cnm  log  Vm  +  EE  zni  log(l  -  vm)  +  (a-  1)  log(l  -  vm)  -  K 


n= 1  l>m 


'y  ^  cnm  log  vm  +  p+EE  Cnl  ~  1  )  log(l  -  Vm)  ~  K 


n=l  l>m 


(E.55) 


Therefore, 


p(vm\-)  =  Beta(z/mi,z/m2) 


(E.56) 


where 


^ml  — 


1  +  E' 


(E.57) 


^m.2  —  OL  +  EE- 


n=  1  l>m 


Useful  moments  in  VB  updates  for  other  model  parameters: 

(In  vm)  =  i)  -  +  vm2) 


(E.58) 


(ln(l  -  vm))  =  ^{vm2)  -  4>{vmi  +  Vm2) 
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(E.59) 

(E.60) 


E.9  Variational  Posterior  on  a 


It  was  derived  in  Section  that  the  NFE  is  maximized  by  a  variational  posterior, 
q(a),  that  is  proportional  to  the  variational  expectation  of  the  true  log-posterior 
with  respect  to  all  other  model  parameters: 

q  (a)  oc  (logp(a|  — ))  (E.61) 

The  true  log-posterior  may  be  calculated  from  Bayes’  theorem: 

logp(o:|— )  =  logp(v|a)  +  logp(a)  —  K,  (E.62) 

where  K  denotes  a  normalizing  constant.  The  variational  posterior  can  be  calculated 
by  solving  the  true  log-posterior  as  a  function  of  a,  then  taking  the  variational 
expectation  (•): 

T- 1 

logp(a|-)  =y^(o:  -X)  log(l  -  vm)  -  K 

m= 1 


log 


(M  -  1) 


a 

T- 1 

-720  +  E  log(l 

m= 1 


exp  (-r20a  +  r10  log  a  -  logT-pdoJ  + 


N—l 


CX  + 


vm)  +  log  a(rio  +  T  -  1)  -  K 

(E.63) 


Therefore, 


p(a\—)  =  Gamma(ri,r2) 

Ti  =  Tio  +  T  -  1 

T- 1 

T2  =  T20  -  ^2  _  Vm) 

m= 1 


(E.64) 

(E.65) 

(E.66) 


Useful  moments  in  VB  updates  for  other  model  parameters: 


(«) 


n 

T2 


(E.67) 
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E.10  Variational  Posterior  on  C 

It  was  derived  in  Section  that  the  NFE  is  maximized  by  a  variational  posterior, 
q( C),  that  is  proportional  to  the  variational  expectation  of  the  true  log-posterior 
with  respect  to  all  other  model  parameters: 

Q  (C)  oc  (logp  (C|— ))  (E.68) 

The  true  log-posterior  may  be  calculated  from  Bayes’  theorem: 

logp  (C|— )  oc  logp(T,X(c)|C,-)  +  logp(C)  -K,  (E.69) 

where  K  denotes  a  normalizing  constant.  The  posterior  will  also  be  multinomial 
with  parameters  (responsibilities)  $: 


log  p(cnm  =  1|-)  =  log  pr 


OClog(j(^nm)  +  ^  (7 nm  £nm)  ^  ((nm)  (7 nm  £nm)  ^g^^Og27T  ^  log  | Am  | 


-  -  (XLC)  -  Mm)'  (xf)  -  pm)  +  logVm  +  ^  (1  ~  U») 


l<m 


1  /  'j1 2 

OC  log  O'  {£nm)  T  “  (PV  1]  wmxi  ^  £nm)  ^  (£nm)  ^  wmwmxi,  ^  Vr 


1  1 

2  l0§  I  A™l  “  2  ^X”G)  _  Am  7«C)  “  *0  +  loSVm  +  l°g  (1  ~  vl) 

l<m 


(E.70) 


Therefore, 


p  ( cn  | — )  =  Multinomial^^ 


(E.71) 


Useful  moments  in  VB  updates  for  other  model  parameters: 


283 


E.ll  Negative  Free  Energy 


The  NFE  can  be  expressed  as  the  difference  between  the  expected  log-likelihood  and 
the  Kullback-Leibler  divergence  (KLD)  between  the  variational  posteriors  and  the 
priors: 

T  =(logp(t,  X(C) |-))  -  KLD  [q  Qa,  A,  W,  f3,  C,  v,  a)  | j p  (/n  A,  W,  /3,  C,  v, «)] 

T 

— (l°gp(t|W ,  X(r)))  +  (logp(X(cV,A,C)>  -  22  KLD  [q  (wm) \\p(ytm\(3m)} 

m= 1 

TP  N 

-EE  KLD  [q(Pmd)  |  \p(Pmd)\  -  y,  KLD  [g(cn)||p(cn|v)] 

m=l p=l  n=l 

T 

'y  ^  KLD  [g(pm |  Am)g( Am)  |  |p(pm |  Am)p( Am)] 

m=l 

T 

-  22  [9(^m)lb(^m)]  -  KLD  [q(a)\\p(a)} 

m= 1 


AT 


71=1  771=1 


log  a  (£71771)  “1“  2  ( {ifrim)  £71771)  ^  (£n 


f(72  )-£2  ) 

V  \  Inm/  S .Tim/ 


AT 


NEE<  O71771  )  Z7  log2vr  -  (log  |Am|)  +  ((x  -  /zm)T  Am  (x  -  /zj) 


71=1  771=1 


T  TP 

-  KLD  [g  (wm)  |  |p  (wm|/3m)]  -  ^  ^  KLD  [q  (/3md)  \  \p  (/3md)] 

771=1  771=1  p=l 


AT  T 

-  X^KLD  [g(cn)||p(cn|v)]  -  J^KLD  [g(pm|Am)g(Am)||p(pm|Am)p(Am)] 

71=1  771=1 


T 

-  ^  [q(vm)\\p(Vm)] 

771=1 

-KLD  [q(a)\\p(a)] 

(E.72) 
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N 


^  ^ {pnrn) 


n=  1 


log^(^)  +  - 


(2 tn-  l)x^T)T(wm) 


A  (U)  fxf  )T(wmw^)0  (xn)  -  £ 


2 

ran 


AT  T 


(Cnm)  D  log  2vr  -  (log  |  Am|)  +  ( (x^c)  -  nm)  Am  (xjf }  -  flm)  ) 
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where  KLD  [q  (wm)  |  \p  (wm|  |/3m)]  is  a  KLD  between  two  Gaussian  distributions, 

KLD  [q  (fimd)  |  b  (And)]  is  a  KLD  between  two  Gamma  distributions, 

KLD  b (cn)  |  b(cn  | v)]  is  a  KLD  between  two  multinomial  distributions, 

KLD  b(pm|Am)g(Am)|b(pm|Am)p(Am)]  is  a  KLD  between  two  Normal- Wishart  dis¬ 
tributions,  KLD  [q(vm)  |  \p{vm)\  is  a  KLD  between  two  Beta  distributions,  and  KLD  [q(a)  \  \p(a )] 
is  a  KLD  between  two  Gamma  distributions. 
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